13
JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 1 Deep Reinforcement Learning for Combinatorial Optimization: Covering Salesman Problems Kaiwen Li, Tao Zhang, Rui Wang Yuheng Wang, and Yi Han Abstract—This paper introduces a new deep learning approach to approximately solve the Covering Salesman Problem (CSP). In this approach, given the city locations of a CSP as input, a deep neural network model is designed to directly output the solution. It is trained using the deep reinforcement learning without supervision. Specifically, in the model, we apply the Multi-head Attention to capture the structural patterns, and design a dynamic embedding to handle the dynamic patterns of the problem. Once the model is trained, it can generalize to various types of CSP tasks (different sizes and topologies) with no need of re-training. Through controlled experiments, the proposed approach shows desirable time complexity: it runs more than 20 times faster than the traditional heuristic solvers with a tiny gap of optimality. Moreover, it significantly outperforms the current state-of-the-art deep learning approaches for combina- torial optimization in the aspect of both training and inference. In comparison with traditional solvers, this approach is highly desirable for most of the challenging tasks in practice that are usually large-scale and require quick decisions. Index Terms—Covering Salesman Problem, Deep Learning, Attention, Deep Reinforcement Learning. I. I NTRODUCTION The Traveling Salesman Problem (TSP) is a frequently studied combinatorial optimization problems in the field of operation research. Given a set of cities with their spatial locations, the goal of TSP is to find a minimum length tour that visits each city once and returns back to the original city. In this work we focus on the Covering Salesman Problem (CSP), which is a generalization of the TSP. In the CSP, each city is given a predefined covering distance, within which all other cities are covered. The CSP seeks a minimum length tour over a subset of the given cities such that each city has to be either visited or has to be covered by at least one city on the tour. The CSP reduces to a TSP if the covering distance of each city is zero. Thus the CSP is NP-hard and is more difficult to be solved than the TSP. In some practical scenarios, due to the limitation of fuel or manpower resources, it is hard to guarantee that each city can be travelled exactly This paper is partially supported by the National Natural Science Founda- tion of China (No. 72071205 and No. 61773390). Kaiwen Li, Tao Zhang and Rui Wang (corresponding author) are with the College of Systems Engineering, National University of Defense Technology, Changsha 410073, PR China, and also with the Hunan Key Laboratory of Multi-Energy System Intelligent Interconnection Technology, HKL-MSI2T, Changsha 410073, PR China. (e-mail: kaiwenli [email protected], zhang- [email protected], [email protected]) Yuheng Wang is with the Graduate College, National University of Defense Technology, Changsha 410073, PR China Yi Han is with the Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer, National University of Defense Technology, Changsha 410073, PR China Manuscript received April 19, 2021; revised August 26, 2021. once as assumed in the TSP. Hereby, the CSP arises in various and heterogeneous real-world applications such as emergency management and disaster planning [1], [2], [3]. For example, in the routing problem of healthcare delivery teams [1], it is not necessary to visit each village, because the villages that are not visited are expected to go to their nearest stop for service. Traditional approaches to solving such NP-hard combina- torial optimization problems mainly include three categories [4]: exact algorithms, approximate algorithms and heuristic algorithms. Exact algorithms can produce optimal solutions by enumeration or other techniques such as branch-and-bound, however, require forbidding computing time when tackling large instances. Approximate algorithms, in general, can obtain near-optimal solutions in polynomial time with theoretical optimality guarantees. However, such approximate algorithms might not exist for all types of optimization problems and the execution time can be still prohibitive if their time complexity is higher-order polynomial. Heuristic methods are more favor- able in practice since they usually run faster than the above two types of approaches. But they lack theoretical guarantees for the quality of the solutions. In addition, as iteration-based approaches, heuristic methods still suffer from the limitation of long computation time when tackling with large instances as a large number of iterations are required for population updating or heuristic searching. Moreover, esoteric domain knowledge and trial-and-error are usually required when designing such heuristic methods, leading to considerable designing effort and time. As most of the challenging problems in real-world appli- cations are large-scale and are usually under the constraint of execution time, traditional algorithms suffer from specific lim- itations when applied to practical challenging tasks: forbidding computation time and the need to be revised or re-executed whenever a change of the problem occurs. This can be imprac- tical for large-scale tasks in real-world applications. Recent advances in deep learning have shown promising ability of solving NP-hard decision problems. A well-known example is the inspiring success of employing deep reinforcement learning to solve the game Go [5], which is a complex discrete decision problem. In the context of the advances attained by the deep learning in solving various decision tasks, recently, some works have focused on using the deep learning to solve classical NP-hard combinatorial problems, including the TSP. They replace the carefully handcrafted heuristics by the policy parameterized by a deep neural network and learn the policy from data. They provide a new paradigm for combinatorial optimization: the solution is directly output by a deep neural network which is trained on a collection of instances from a arXiv:2102.05875v1 [cs.NE] 11 Feb 2021

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 1

Deep Reinforcement Learning for CombinatorialOptimization: Covering Salesman Problems

Kaiwen Li, Tao Zhang, Rui Wang Yuheng Wang, and Yi Han

Abstract—This paper introduces a new deep learning approachto approximately solve the Covering Salesman Problem (CSP).In this approach, given the city locations of a CSP as input, adeep neural network model is designed to directly output thesolution. It is trained using the deep reinforcement learningwithout supervision. Specifically, in the model, we apply theMulti-head Attention to capture the structural patterns, anddesign a dynamic embedding to handle the dynamic patternsof the problem. Once the model is trained, it can generalizeto various types of CSP tasks (different sizes and topologies)with no need of re-training. Through controlled experiments, theproposed approach shows desirable time complexity: it runs morethan 20 times faster than the traditional heuristic solvers with atiny gap of optimality. Moreover, it significantly outperforms thecurrent state-of-the-art deep learning approaches for combina-torial optimization in the aspect of both training and inference.In comparison with traditional solvers, this approach is highlydesirable for most of the challenging tasks in practice that areusually large-scale and require quick decisions.

Index Terms—Covering Salesman Problem, Deep Learning,Attention, Deep Reinforcement Learning.

I. INTRODUCTION

The Traveling Salesman Problem (TSP) is a frequentlystudied combinatorial optimization problems in the field ofoperation research. Given a set of cities with their spatiallocations, the goal of TSP is to find a minimum length tourthat visits each city once and returns back to the original city.In this work we focus on the Covering Salesman Problem(CSP), which is a generalization of the TSP. In the CSP, eachcity is given a predefined covering distance, within which allother cities are covered. The CSP seeks a minimum lengthtour over a subset of the given cities such that each cityhas to be either visited or has to be covered by at least onecity on the tour. The CSP reduces to a TSP if the coveringdistance of each city is zero. Thus the CSP is NP-hard andis more difficult to be solved than the TSP. In some practicalscenarios, due to the limitation of fuel or manpower resources,it is hard to guarantee that each city can be travelled exactly

This paper is partially supported by the National Natural Science Founda-tion of China (No. 72071205 and No. 61773390).

Kaiwen Li, Tao Zhang and Rui Wang (corresponding author) are with theCollege of Systems Engineering, National University of Defense Technology,Changsha 410073, PR China, and also with the Hunan Key Laboratory ofMulti-Energy System Intelligent Interconnection Technology, HKL-MSI2T,Changsha 410073, PR China. (e-mail: kaiwenli [email protected], [email protected], [email protected])

Yuheng Wang is with the Graduate College, National University of DefenseTechnology, Changsha 410073, PR China

Yi Han is with the Science and Technology on Parallel and DistributedProcessing Laboratory, College of Computer, National University of DefenseTechnology, Changsha 410073, PR China

Manuscript received April 19, 2021; revised August 26, 2021.

once as assumed in the TSP. Hereby, the CSP arises in variousand heterogeneous real-world applications such as emergencymanagement and disaster planning [1], [2], [3]. For example,in the routing problem of healthcare delivery teams [1], it isnot necessary to visit each village, because the villages that arenot visited are expected to go to their nearest stop for service.

Traditional approaches to solving such NP-hard combina-torial optimization problems mainly include three categories[4]: exact algorithms, approximate algorithms and heuristicalgorithms. Exact algorithms can produce optimal solutions byenumeration or other techniques such as branch-and-bound,however, require forbidding computing time when tacklinglarge instances. Approximate algorithms, in general, can obtainnear-optimal solutions in polynomial time with theoreticaloptimality guarantees. However, such approximate algorithmsmight not exist for all types of optimization problems and theexecution time can be still prohibitive if their time complexityis higher-order polynomial. Heuristic methods are more favor-able in practice since they usually run faster than the abovetwo types of approaches. But they lack theoretical guaranteesfor the quality of the solutions. In addition, as iteration-basedapproaches, heuristic methods still suffer from the limitation oflong computation time when tackling with large instances as alarge number of iterations are required for population updatingor heuristic searching. Moreover, esoteric domain knowledgeand trial-and-error are usually required when designing suchheuristic methods, leading to considerable designing effort andtime.

As most of the challenging problems in real-world appli-cations are large-scale and are usually under the constraint ofexecution time, traditional algorithms suffer from specific lim-itations when applied to practical challenging tasks: forbiddingcomputation time and the need to be revised or re-executedwhenever a change of the problem occurs. This can be imprac-tical for large-scale tasks in real-world applications. Recentadvances in deep learning have shown promising ability ofsolving NP-hard decision problems. A well-known exampleis the inspiring success of employing deep reinforcementlearning to solve the game Go [5], which is a complex discretedecision problem. In the context of the advances attained bythe deep learning in solving various decision tasks, recently,some works have focused on using the deep learning to solveclassical NP-hard combinatorial problems, including the TSP.They replace the carefully handcrafted heuristics by the policyparameterized by a deep neural network and learn the policyfrom data. They provide a new paradigm for combinatorialoptimization: the solution is directly output by a deep neuralnetwork which is trained on a collection of instances from a

arX

iv:2

102.

0587

5v1

[cs

.NE

] 1

1 Fe

b 20

21

Page 2: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 2

certain type of problem. Such learning-based approaches haveshown appealing advantages over traditional solvers in theaspect of computational complexity and the ability of scalingto unseen problems [6], [7].

In literature, most of studies focuses on designing differenttypes of heuristic algorithms to tackle CSP. These carefully de-signed heuristics can certainly improve the performance, how-ever, are problem-specific and suffer from the aforementionedlimitations. In this paper, we propose a new deep learningapproach to approximately solve CSP. The promising idea toleverage the deep learning for combinatorial optimization hasbeen tested on TSP. However, as a generalization of the TSP,the CSP appears harder to be addressed due to its dynamicfeature. Therefore we propose a powerful deep neural networkmodel based on Multi-head Attention and dynamic embeddingthat can effectively map from the problem input of the CSPto its solution. The model is trained using deep reinforcementlearning in an unsupervised way. The proposed approach hasempirically shown a handful of merits in the following aspects:• Optimality. On solving the CSP, the proposed approach

significantly outperforms the recent state-of-the-art deeplearning approaches for combinatorial optimization.

• Execution time. The presented approach runs more than20 times faster than the traditional heuristic solvers withonly a small optimality gap. It offers a desirable trade-off between the execution time and the optimality ofsolutions.

• Scalability. Once trained, the model can be used tosolve, not only one, but a collection of CSP instancesof different size and/or different city locations as long asthey are from the same data generating function. It canscale to unseen instances with no need of re-training.

• Generalization. The proposed approach can also gen-eralize to different CSP tasks, e.g., CSPs where eachcity has a fixed covering radius within which all citiesare covered, or CSPs where each city can cover a fixednumber of cities. The model that is only trained on onetype of CSP task, however, can solve different types ofCSP tasks without the need to re-train the model.

II. RELATED WORK

A. Covering Salesman Problems

The covering salesman problem (CSP) is first formulatedin 1989 by Current and Schilling [1]. They presented astraightforward heuristic approach to solve it. It can be dividedinto two parts. First a minimum number of cities that cancover all the cities are selected, namely a set covering problem(SCP). The second phase can be regarded as a TSP: theshortest tour is found on the above selected cities. Sinceusually more than one solution can be found for the SCP,the TSP solver is applied on all the found solutions and theTSP solution with the minimum length is output as the CSPsolution.

From then on, a number of good heuristics are designedto solve the CSP effectively. Golden et al. [8] proposed twolocal search algorithms (LS1 and LS2) to solve CSP. LS1and LS2 have been widely used as benchmark approaches in

the literature. As local search methods, they all start from arandom feasible solution and improve it by making use ofdifferent operators such as destroy, repair and permutation.LS1 first removes a fixed number nodes from the currentsolution according to a deletion probability. New nodes arethen inserted into the current tour one by one according to ainsertion probability until the solution is feasible again. Theseprobabilities are determined by the decrease or increase of thetour length when deleting or adding that node into the tour.In contrast, LS2 removes one node a time from the currentsolution and then insert the nearest nodes from the removednode into the solution. Lin-Kernighan Procedure and 2-Optprocedure is also used to improve the solution.

Salari et al. [9] developed an Integer Linear Programming(ILP) based heuristic method to solve the CSP. Its maindifference from the method of [8] is its idea of applyingILP to improve the solution. This approach first applies thesimilar destroy and repair operations to decrease the tourlength by removing some nodes from the current solution andinserting new nodes back to make a feasible solution. Thesolution is further improved by solving an ILP model withthe objective of minimizing the tour length. Other local searchapproaches also apply the similar destroy and repair operationsto solve the CSP. For example, Shaelaie et al. [10] proposeda variable neighborhood search method and Venkatesh et al.[11] proposed a multi-start iterated local search algorithm.

In addition to the local search approaches, various popula-tion based heuristic approaches have been proposed. Salari etal. [12] combines ant colony optimization (ACO) algorithmand dynamic programming to tackle the CSP. It also incorpo-rates various local search procedures such as removal, insertionand the 3-OPT search. Experiments show this approach to besuperior on large size instances. Tripathy et al. [13] presented agenetic algorithm (GA) with new designed crossover operatorsfor solving the CSP. However its performance fails to defeatLS1 and LS2. Different types of GAs [14], [10] have also beendeveloped to solve the CSP. Additionally, Pandiri et al. [15]proposed an artificial bee colony algorithm for this problem.

In terms of the performance, the above heuristic algorithmsachieve a same level of optimality to the earlier proposedLS1/LS2 on most instances. Some of the heuristics mayoutperform LS1/LS2 on several instances. For years, variousheuristics have been designed to solve CSP with significantspecialized knowledge and trial-and-error efforts. However,there was no significant breakthrough and currently no ma-chine learning methods have been proposed for the CSP tothe best of our knowledge.

B. Deep Learning Approaches for Combinatorial Optimiza-tion

The first application of using Neural Networks (NNs) totackle the challenge of combinatorial optimization is theHopfield-Network [16] proposed in 1985. However, it istrained only to solve one instance a time with little advantagesover traditional solvers.

Recent five years have seen a surge in the applications ofdeep learning for combinatorial optimization. Deep learning

Page 3: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 3

approaches for combinatorial optimization are usually basedon the end-to-end learning mode, that is, using a DeepNeural Network (DNN) to directly output the optimal solution.Current state-of-the-art approaches mainly use sequence-to-sequence networks [17] and Graph Neural Networks (GNNs)[18] for combinatorial optimization. Representative works arereviewed as follows.

Vinyals et al. [6] first proposed to use a sequence-to-sequence model for combinatorial optimization, also known asthe Pointer Network. It uses the attention mechanism to outputa permutation of an input sequence. The model is trainedoffline using pairs of TSP instances and their (near) optimaltours in a supervised fashion. It successfully solves the smallsize TSP and reinvigorates this line of work that applies deeplearning for combinatorial optimization.

Learning from supervised labels might be inapplicable be-cause it is expensive to construct high-quality labeled data andit prohibits the learned model from performing better than thetraning examples. In this context, Bello et al. [19] proposed touse an Actor-Critic [20] deep reinforcement learning (DRL)algorithm to train the Pointer Network in an unsupervisedmanner. It takes each TSP instance as a training sample anduses the tour length of the solution obtained by the currentpolicy as the reward, which is used to update the policyparameters via the policy gradient formula. This approachachieves a comparable performance to [6] on small instancesand can further solve larger TSP instances (up to 100 cities).Nazari et al. [21] replaces the LSTM [22] encoder of thePointer Network by a simple node embedding. It can save upto 60% training time for the TSP while maintaining a similarperformance. This model can also solve the Vehicle RoutingProblem (VRP) by adding additional dynamic elements to theattention mechanism.

Khalil et al. [23] considered the graph structure of the com-binatorial optimization problem by adopting a structure2vecGNN model. It consecutively inserts a node into the currentpartial solution according to the node scores parameterized bystructure2vec. The model is trained using the DQN method[24] and is applied on the Minimum Vertex Cover and Max-imum Cut problems other than the TSP. Mittal et al. [25]replaced the structure2vec of [23] by a Graph ConvolutionalNetwork (GCN) and followed the same greedy algorithm. Itperforms better than [23] on large graphs.

Nowak et al. [26] also used a GNN to model the problem.But it outputs an adjacency matrix instead, which is thenconverted into a feasible solution by beam searching. Themodel is trained with supervision. As this method constructssolutions in a non-autoregressive manner, it performs worsethan the above autoregressive approaches for the TSP. Joshi etal. [27] followed the same paradigm to [26] but used a moreeffective GCN to encode the problem instance. Experimentsshow its superior performance on the TSP when using beamsearch and shortest tour heuristic to construct the solutions,however, it requires much longer execution time. Moreover,Li et al. [28] used a GCN to output the probability mapthat represents the likelihood of each node belong to theoptimal solution. Consequently a guided tree search is usedto construct the solution according to the probability map of

all nodes.Inspired by the Transformer architecture [29], which is

the state-of-the-art model in the field of seqence-to-seqencelearning, authors in [30], [7] applied the Multi-head Attentionmechanism of the Transformer to tackle the combinatorialoptimization challenge. They build the TSP solutions autore-gressively using the attention similar to the Pointer Network.Deudon et al. [30] demonstrated that incorporating a 2-OPTlocal search [31] can improve the performance. Kool et al.[7] designed a more effective decoder and proposed to trainthe model using a new reinforcement learning method witha greedy rollout baseline. This approach achieves the state-of-the-art performance among the concurrent deep learningapproaches for the TSP. Moreover, it can scale to a varietyof practical problems like the VRP, the Orienteering Problem,etc.

In addition, Li et al. [32] adopted the deep learning ap-proaches for the multi-objective TSP. This approach outper-forms traditional multi-objective optimization solvers in termsof execution time and solution quality. Other combinatorialoptimization problems tackled by similar deep learning modelsinclude the Maximal Independent Set [33], the Graph Coloring[34], the Boolean Satisfiability [34], etc. Recently, Joshi et al.[35] explored the impact of different training paradigms andrevealed favorable properties of reinforcement learning oversupervised learning.

III. PRELIMINARIES

A. Problem Definition

In the CSP, we are given a complete graph G = (V,E)where N = {1, 2, . . . , n} is the node set that represents thecities. E = ({i, j} : i, j ∈ N, i < j) is the edge set and cijrepresents the cost of edge {i, j} which is usually defined asthe shortest distance between node i and j. Each node i cancover a subset of nodes Si. In the CSP, we are expected tofind a Hamiltonian tour over a subset of nodes V such thatcities that are not on the tour must be covered by at least onecity on the tour. The objective is to minimize the total tourlength.

In this paper, we focus on the Euclidean CSP: the cost ofedge {i, j} is defined by the Euclidean distance between nodei and j given their two-dimensional geographical locations.

B. Attention Theory for Combinatorial Optimization

For ease of understanding our model in this paper, we firstintroduce the fundamental theory of how to use deep learningtechniques to solve the combinatorial optimization problem,the TSP specifically.

Node embedding. The initial node embedding of node iis a mapping of the 2-dimensional input of node i, i.e., two-dimensional coordinates of its location, to a dh-dimensionalvector (dh = 128 in this work).

Pointer Network. The model architecture of PointerNetwork is shown in Fig. 1. Its basic structure is the Encoder-Decoder. Encoder is used to encode the node embeddings toa Vector, while decoder is to decode the Vector into an outputsequence. Either the encoder or decoder is a Recurrent Neural

Page 4: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 4

Network (RNN). While using the encoder RNN to encode theinputs, the hidden state ei is produced for each city i. Here,ei can be interpreted as the feature vector of city i. The finalhidden state of the encoder RNN is the Vector.

As indicated in Fig. 1, the Vector is used as input to thedecoder RNN, and the decoding happens sequentially. TheVector also serves as the initial hidden state d0, which isused to select the first city to visit. The next hidden state d1is obtained by inputting the node embedding of the selectedcity to the encoder RNN. Specifically, at decoding step t, theprobability of city selection is calculated by the encoder statedt−1 and the feature vector ei of each city. The RNN readsthe node embedding of the selection and output the currentencoder state dt, which is then used in the next decodingstep. The calculation of the probability is namely the Attentionmethod and is computed as follows.

uti = vT tanh (W1ei +W2dt) i ∈ (1, . . . , n)P (yt+1|y1, . . . , yt, X) = softmax (ut)

(1)

where v,W1,W2 are learnable parameters. The one with thelargest probability can be selected as the city to be visited atstep t.

For example, as shown in Fig. 1, to select the first cityat decoding step t = 1, u11, u

12, u

13 are calculated by (d0, e1),

(d0, e2) and (d0, e3) according to Equation (1). The one withthe largest u1, i.e., u12, is selected as the first city, which is city2. By inputting the node embedding of city 2 into the decoderRNN, we can get d1. The next city is selected according tothe probabilities calculated by applying the attention formulaon (d1, e1), (d1, e2) and (d1, e3). Then the procedure loopstill all the cities are selected. After training the parametersv,W1,W2 offline, the Pointer Network obtains the ability tooutput the desired tour of the cities.

RNNUnit

X(1)

RNNUnit

X(2)

e(1) e(2)hidden state

RNNUnit

X(3)

e(3)

RNNUnit

2

RNNUnit

1

d(1) d(2)

RNNUnit

3

d(3)

Vector

argmax u(e,d) 1

2

3

d(0)

Fig. 1. Take a 3-city TSP instance as an example to demonstrate the PointerNetwork. Inputs are the embeddings of the city locations. The cities areselected step by step. At each step t, the city with the largest ut is selected.uti for each city i is calculated according to (1).

Self-attention. In the above Pointer Network, RNN isused to get the feature vector ei of each city i in theencoder. Self-attention achieves the same functionality. It isthe basic component of Transformer[29], which is the state-of-the-art model in the field of sequence-to-sequence learning likemachine translation. By applying self-attention in the encoder,the obtained attention value of each input i stores not only theinformation of itself, but also its relations with other inputs.Thus self-Attention can be more effective in capturing thestructural patterns of the problem.

Recall the attention mechanism used in (1), dt in the tthcity selecting step can be seen as a query. It contains theinformation of already visited cities. The representation ei ofeach city i can be seen as a key. The ith city is given moreAttention if its key is more compatible with the query. In (1),the Attention value is interpreted as the probability of beingselected.

Self-attention works by calculating the attention value ofeach component in a sequence to all other components. Andthe obtained attention value of the ith component serves as itsrepresentation. Therefore the processed embeddings containmore information compared to its original sequence. Differentfrom the attention calculation method in (1), the Scaled Dot-Product Attention is used in Transformer’s self-attention: thecompatibility of query qt and key ki is simply calculated astheir multiply qTt ki and then scaled by a scaling factor. Theself-attention mechanism is introduced as follows:

The basic attention is realized by query and the key-valuepair, that is, more attention would be given to value if its key ismore compatible with query. For self-attention in the encoder,query, key and value are all linear projections of the inputnode embeddings:

Q =WQX, K =WKX, V =WVX. (2)

That is, the query, key and value of ith node is:

qi =WQXi, ki =WKXi, vi =WVXi. (3)

The process of calculating the attention value of the ithnode is depicted in Fig. 2. First the relations of the ith nodewith other nodes are computed by a compatibility function ofthe query of the ith node with the keys of all other nodes:qiK

T . The obtained compatibility serves as weights and arenormalized by the softmax function. The attention value isfinally computed as a weighted sum of the values V, wherethe weights are qiK

T . Thus, the attention value of the ithnode is qiK

TV, which has the same dimension (dh) with itsinitial node embedding Xi. As a result of the self-attentionprocess, the computed attention value of the ith node storesnot only its own knowledge but also its relations with othernodes.

F(Q,K)Query1

s1

Key1

a1

Value1

F(Q,K)

s1

Key2

a1

Value2

F(Q,K)

s1

Key3

a1

Value3

F(Q,K)

s1

KeyN

a1

ValueN

SoftMax() to Normalize

Attention Value + + +Output

Fig. 2. The process of calculating the attention value given the query andkey-value pair.

Page 5: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 5

In practice, we compute the attention values of all nodessimultaneously by matrix multiplying:

Attention (Q,K, V ) = softmax

(QKT

√dk

)V (4)

where√dk is the scaling factor and dk is the dimension of

the key.Multi-Head Attention (MHA). As pointed out in [29],

it is beneficial to carry out the attention process multipletimes and obtain M attention values. Then the M attentionvalues are linearly projected into a final attention value. Hence,various features can be extracted using the MHA process.

Overall, structural patterns of the problem can be extractedeffectively by using the MHA in the encoder.

IV. ATTENTION MODEL

We still follow the Encoder-Decoder architecture to modelthe CSP. The encoder is used to construct the key of themodel that consists of two parts: static embeddings obtainedby the MHA that contain the structural patterns of the problem;dynamic embeddings obtained by a designed guidance vectorthat describes the dynamic patterns of the problem. Thedecoder is designed with a RNN and an additional attentionoperator.

The model is formally described as follows. Taking a CSPinstance s with n cities as an example, the feature xi of city i isdefined by its location, i.e., its x-coordinate and y-coordinate.A solution π = (π1, . . . , πk, k ≤ n) is a permutation of asubset of the cities. All the cities should be visited or coveredby traveling along the tour π. With this formulation of CSP,the Attention-dynamic model is designed. The model takes citylocations of instance s as input, and outputs the sequence π.More formally, the model defines a policy p(π|s) for selectingπ given s:

pθ(π|s) =k∏t=1

pθ (πt|s,π1∼t−1) , k ≤ n. (5)

The encoder encodes the original features x and producesthe final embeddings (representations) of the cities. The de-coder takes as input the encoder embeddings, summarizingthe information of previously selected cities π1∼t−1 andthen outputs πt, one city at a time. θ represents the modelparameters such as WQ, WK and WV . With an optimal setof θ∗, the model has the ability of outputting the optimal tourπ∗ given a problem instance s.

A. Encoder

The purpose of the encoder is to produce a representation foreach city, that is, produce the key of each city. Our key consistsof two parts: static embeddings and dynamic embeddings.

Construction of static embeddings. The static embed-ding demonstrate the static feature of a city, i.e., its location. Itis constructed similar to the TSP task in [30], [7]. We employthe introduced MHA to produce the static embeddings, asshown in Fig. 3. Its basic element is the Self-attention, whichcan produce the representation for each city i that stores not

only its own feature but also its relations with other cities. Asdepicted in Fig. 3, in the MHA, the self-attention is conductedh times and the obtained h embeddings are projected into afinal embedding.

The architecture of computing the static embeddings con-sists of N sequentially connected attention layers that all havethe same structure. That is, the embeddings are processed Ntimes sequentially. Each attention layer has two sub-layers:the first is a MHA layer, and the second is a fully connectedfeed-forward layer. In addition, each sub-layer is added with aresidual connection [36] and a layer normalization. Thus, theoutput of each sub-layer becomes Norm(x+ Sublayer(x)).

Overall, computing static embeddings needs two steps:• We first map the dx-dimensional city locations (dx = 2)

to the dh-dimensional node embeddings (dh = 128):

h(0)i =Wxxi + bx. (6)

Wx and bx are learnable parameters. Assuming there aren cities, thus the shape of the initial node embeddings isn× dh.

• As shown in Fig. 3, the initial node embeddings h(0) areprocessed and updated through N Attention layers. Theoutput embeddings of the `th layer are computed as:

htmp = Norm`(h(`−1) +MHA`

(h(`−1)

))h(`) = Norm`

(htmp + FF` (htmp)

) . (7)

The output embedding h(N) of the Nth layer is the finalstatic embedding, where h(N)

i is the dh-dimensional staticfeature vector of city i.

Construction of dynamic embeddings. The static em-beddings h(N) can be directly taken as the key for the TSP[7]. However, the states of the cities in a CSP are not staticduring the decoding step. For example, once a city is covered,its possibility of being visited might be reduced. And thepossibility can be rather reduced if it is covered more times.Thus it is vital to introduce the concept of Covering intothe construction of key for CSP optimization. Aiming at this,we define a guidance vector gi for each city i that indicatesits state of covering, and gi is dynamically updated at eachdecoding step of t.

Assuming city πt is selected to be visited at decodingstep t, C(πt) is defined as the set of cities that are covered byπt. And the number of nodes in C(πt) is counted as NC(πt).Cities in C(πt) are then sorted according to their distancesto πt. Thus, each city in C(πt) has a sorted index ci rangingfrom 1 to NC(πt), where ci = 1 indicates that city i is nearestto πt, and vice versa.

At first, gi is initialized to 1 for all cities:

gi = 1, i = 1, · · · , n. (8)

And at each decoding step t, gi is updated dynamically:

gi = gi ×ci

NC(πt), i ∈ C(πt), i = 1, · · · , n. (9)

Page 6: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 6

Initial Node Embeddings: h(0)

Q K V

Updated Embeddings: h(1)

City Locations: x

Self-Attention

Q, K, VQ, K, VQ, K, V

Embeddings

Q, K, VQ, K, VSelf-Attention h

h

Concat

Linear Project

Updated Embeddings

Multi-Head Attention

Multi-Head Self-Attention

Add + Norm

Initial Node Embeddings: (h0)

Feed Forward

Add + Norm

One Attention Layer

N Layers

Final Embeddings: h(N)

Static Embeddings Calculation

Matrix Multiply

Matrix Multiply

SoftMax

Linear Project

Fig. 3. This figure gives the process of calculating the static embeddings in our model. It consists of N layers of multi-head attention modules which isformed by the self-attention mechanism.

According to (9), the dynamic state of city i is represented bya value gi. Supposing that city πt is visited, the nearest cityto πt is assigned with the smallest value. gi is further reducedif city i is covered again by other cities.

Hence, the obtained guidance vector contains the coveringinformation for each city: the state that it is covered or notand its distance to the one who covers it. It can thus guidethe selection of the next city, e.g., the one that is closer to thealready visited cities may have a lower probability of beingselected.

It is noted that we do not explicitly reduce the selectionprobability of city i according its gi value. It is because thata covered city can also be a good option in some occasions.Instead, we let the model itself to learn how to utilize gi toadjust the probability. Aiming at this, gi is linearly projectedto a dh-dimensional embedding Gi by learnable weights WG:

Gi = giWG (10)

As Gi dynamically changes during the decoding, Gi iscalled the dynamic embedding in this study. While the staticembedding h(N) indicates the static states of the cities, i.e.,their spatial locations, the dynamic embedding G indicatestheir dynamic states of covering. Then we construct the keyaccording to h(N) and G:

ki =WKh(N)i Gi, i = 1 · · ·n (11)

where ki represents the key for city i and WK are thelearnable weights with size (dh × dh). In this way, the modelcan capture both of the static and dynamic patterns of theproblem and thus learns better policies.

B. Decoder

The decoder is composed of two parts: RNN, whose outputdt is used to compute the query qt at decoding step t ; Encoder-Decoder Attention, which is used to calculate the probabilityof city selections according to query qt and key. Intuitively, the

city with the largest probability is selected as πt at decodingstep t. The probability is calculated through the Encoder-Decoder Attention based on the query qt from the RNN andthe key from the encoder. That is, among all keyi, i = 1 . . . n,the one that most matches qt is selected as the next city tobe visited. This part illustrates how to build the query in themodel.

Construction of query. To select the next city to bevisited, it is naturally to take the previously visited cities asthe query. As RNN is capable of handling such sequentialinformation, in this study, GRU [37], which is a variant ofRNN, is employed to construct the query in the decoder. Weset the mean of the encoder embeddings h = 1

n

∑ni=1 h

(N)i

as the initial hidden state of the decoder RNN, i.e., d0 = h.Meanwhile, we use the learned parameter v1 as the first inputto the decoder RNN at step t = 1. In the following stepst > 1, the input to the RNN is the node embedding h

(N)πt−1 of

the previous selection πt−1:

ot, dt = fGRU (It, dt−1) , d0 = h (12)

It =

{v1 t = 1

h(N)πt−1 t > 1

(13)

dt is the hidden state of the decoder RNN at decoding stept, and it stores the information of previously selected citiesπ1∼t−1. As the selection of πt largely depends on π1∼t−1, dtcan be employed as the query naturally.

However, π1∼t−1 is not the only factor that determinesπt. To construct the query, the relations between the alreadyvisited cites and the rest of cities should also be considered.For example, the city that is closer to the last visited city mighthave more opportunity to be visited next. Thus we furtherprocess dt using the Attention mechanism as follows:

qt = Attention(dt,K1, V1) = softmax

(dtK

T1√dk

)V1, (14)

Page 7: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 7

where K1 and V1 are linear projections of the node embed-dings h(N) from the encoder. Thus the query qt for selectingπt is defined as the attention value of dt and the key-valuepairs K1, V1. By adding the extra attention operation to dt, theobtained query contains richer information: the compatibilityof previously selected cities (represented by dt) with all othercities (represented by h(N)).

C. Compute Probability by Attention

At the decoding step t, we compute the selection probabilityof all the cities i = 1 · · ·n based on the query-key (qt, ki) bythe Scaled Dot-Product Attention:

uti =qTt ki√dk, (15)

where√dk is the scaling factor and dk = dh

M . M is thenumber of heads in the MHA. In addition, we mask the citiesthat are already visited to make sure they will not be selectedagain:

uti =

{−∞ if j ∈ π1∼t−1qTt ki√dk

otherwise.(16)

The probability will be zero after applying softmax operatorto the already visited cities with uti = −∞. The final proba-bility is computed by scaling uti using the softmax operator:

pi =eu

ti∑n

j=1 eutj

, i = 1, · · · , n (17)

Assuming that the greedy strategy is used, the city with thelargest probability pi is selected to be visited at decoding stept.

The model is therefore composed of the above encoder,decoder and attention. The encoder takes as input the featuresof the CSP instance and outputs the encoder embeddings.Decoding happens sequentially from t = 0 and stops untilall the cities are visited or covered. The selected cities formsthe final solution (π0, π1, · · · ), i.e., permutation of the cities.

V. TRAIN WITH REINFORCE ALGORITHM

Given a CSP instance s, Section IV defines the model pa-rameterized by parameters θ that can produce the probabilitydistribution pθ(π|s) by Equation (16) and (17). The solutionπ|s can be obtained by sampling from pθ(π|s). Since the goalis to minimize the total length L(π) of the cyclic tour, it issuitable to train the model with the REINFORCE algorithm[38], which uses Monte Carlo rollout to compute the rewards,i.e. play through the whole episode to find a complete cyclictour and the total tour length is computed as the reward (-L(π)). Then the agent can learn to improve itself by the state-action-reward tuple.

Formally, given the actions π|s and the correspondingreward/loss L(π), parameters θ of the model can be updatedby gradient descent using the REINFORCE algorithm [38]:

∇θL(θ) = Epθ(π|s) [∇ log pθ(π|s)L(π)]θ ← θ +∇θL(θ).

(18)

Since a number of city selection actions are sampled fromthe probability distribution and the total length is computed asthe rewards, the recorded rewards of different episodes havehigh variance due to the uncertainty of the sampling. Thevariance provides conflicting descent direction for the modelto learn and hurt the speed of convergence. To reduce thevariance, it is common to introduce the baseline b(s) to rewritethe policy gradient in (18):

∇θL(θ) = Epθ(π|s) [∇ log pθ(π|s)(L(π)− b(s))] . (19)

Loosely speaking, b(s) serves as the average performance.A policy is encouraged if its action performs better than theaverage. A popular method is to train an additional neuralnetwork, i.e., the critic network, to approximate the b(s). Itsparameters are learned via pairs of (s, L(π)).

However, as reported in [7], training a policy network anda critic network simultaneously is hard to converge. Thereforewe follow the greedy rollout baseline as introduced in [7]which is proved to be superior than the critic baseline in thetraditional Actor-Critic training algorithm.

Here, no additional neural network is required to approxi-mate b(s). Instead, given a CSP instance s, b(s) is computed asthe length of the tour from a deterministic greedy rollout of thebaseline policy pθ∗ . The greedy rollout of a policy means thata solution is built by running the policy greedily, i.e., at eachstep, the city with the largest output probability is selected.The baseline policy is the best model so far during training.At the end of each training epoch, once the trained model isbetter than the baseline policy, we replace the baseline policywith the current model.

Algorithm 1 outlines the REINFORCE training algorithmwith the greedy rollout baseline and Adam optimizer [39]. Inthis way, if a sampled solution π is better than the greedyrollout of the best model ever, (L(π)− b(s)) is negative andthus reinforcing such actions, and vice versa. Thus, the modelcan effectively learn to solve the problem, i.e., minimizing thetotal tour length.

Algorithm 1 Training Algorithm using REINFORCEInput: training set X , batch size B, number of epochs,

Nepoch, number of steps per epoch Nsteps1: Initialize model parameters of the current policy and

baseline policy θ,θ∗ ← θ2: for epoch← 1 : Nepoch do3: for step← 1 : Nstep do4: si ← RandomInstance(X ) for i ∈ {1, . . . , B}5: πi ← pθ(si). Run the policy by sampling6: π∗i ← pθ∗(si). Run the policy greedily7: ∇L ←

∑Bi=1 (L (πi)− L (π∗i ))∇θ log pθ (πi)

8: Update θ by Adam according to ∇L9: end for

10: if pθ performs better than pθ∗ then11: θ∗ ← θ12: end if13: end for

Page 8: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 8

VI. EXPERIMENT SETUP

We investigate the performance of the proposed approachon 20-, 50-, 100-, 200-, and 300-CSP tasks.

Baseline Algorithms. The presented approach (Attention-dynamic) is compared to both of the recent state-of-the-artdeep learning techniques as well as classical non-learnedheuristic methods. The most popular deep learning methodson solving combinatorial optimization problems in the pastfew years are included in the comparisons:• Pointer Network [6]. (PN)• Pointer Network with dynamic embeddings [21]. (PN-

dynamic)• Attention model [7]. (Attention)The PN model [6] is the first deep learning model that can

effectively tackle combinatorial optimization problems. It is awidely used baseline approach. The PN-dynamic model [21]aims at solving the VRP problem whose state changes duringthe decision process that has the similar dynamic featurewith CSP. Attention model [7] is the current state-of-the-art deep learning method for solving various combinatorialoptimization problems. Thus these two models are includedas competitors.

Before the emergence of deep learning techniques, heuristicmethods are the main approaches to solve CSP. Therefore,in addition to deep learning methods, we also include non-learned heuristic methods for comparisons. LS1 and LS2 [8]are commonly used benchmark algorithms for solving CSPfrom the literature [8], [10], [9], [12]. Since all the deeplearning models run on Python platform and no source codeis found for LS1/LS2, we implemented LS1/LS2 in Python tomake fair comparisons. The code is written strictly accordingto [8] and is publicly available1.

Hyperparameters Configurations. Hyperparametersused for training our model are shown in TABLE I. Acrossall experiments of our model and the compared neural combi-natorial baselines, we use mini-batches of 256 instances, 128-dimensional node embeddings and Encoder-Decoder with 128hidden units. Models are trained using the Adam optimizer[39] with a fixed learning rate 10−4.

In the training phase, 320,000 CSP instances are generatedand used in training for 50 epochs. The training CSP instancesare randomly generated: city locations are randomly chosenfrom a uniform distribution that ranges from [0,1]×[0,1]; thenumber of nearest neighbor cities that can be covered by eachcity (NC) is set as 7. It is worthy noting that NC is set to afixed number upon training the model, however, it can be setas an arbitrary number when testing. This validates the gener-alization ability of the model and will be further discussed inSection VII-C. We use the same set of hyperparameters andthe same training dataset with fixed random seed for trainingother neural combinatorial baseline models. As for LS1 andLS2, we follow the recommended hyperparameters from [8]for different problem dimensions.

Implementations. All experiments are implemented inPython and conducted on the same machine with one GTX

1https://github.com/kevin031060/CSP Attention/tree/master/LS

TABLE IHYPERPARAMETERS CONFIGURATIONS

HyperParameters Value HyperParameters ValueBatch size 256 Hidden dimension 128

No. of epoch 50 No. of heads 8No. of instances per epoch 320000 No. of Encoder layers 3

Optimizer Adam Learning rate 1e-4

2080Ti GPU and Intel 64GB 16-Core i7-9800X CPU. Ourcode is open source 2 to reproduce the experimental resultsand to facilitate future studies.

Evaluation Procedure. We evaluate the performance onheld-out test set of randomly generated 100 CSP instances andtake the average predicted tour length as the performance indi-cator. Since the compared LS1 and LS2 require long executiontime, we only generate 100 instances for testing. As pointedout in [30], a posterior local search can effectively improvethe quality of solutions obtained by the neural combinatorialmodel. Thus we conduct a simple local search to furtherprocess the obtained tour of our model. The running time ofour method is the sum of the inference time of the neuralnetwork model and the execution time of the local search.

VII. RESULTS AND DISCUSSION

The learning, approximation and generalization ability ofthe presented approach are demonstrated through the followingexperimental results.

A. Comparisons against deep learning baselines on training

We first evaluate the learning ability of our method in thetraining phase against the recent state-of-the-art deep learningbaselines: PN, PN-dynamic and Attention model.

Fig. 4 compares the performances of the compared ap-proaches during training. The average predicted tour lengthof the held-out validation set is taken as the cost to evaluatethe models in the training phase. As the lines may be too close,Fig. 5 shows the final cost of the compared models at the endof the training phase.

We observe that our method clearly outperforms the com-pared baselines in terms of both the convergence speed andthe final cost. PN and Attention model fail to converge tothe optimum as they do not take into account the dynamicinformation of the covering sate of the cities. Although PN-dynamic model considers the dynamic features, it is defeatedby our model since the MHA mechanism that we used caneffectively extract the feature of the CSP task thus helpsconverge significantly faster and lead to a lower cost than thePN-dynamic model. Our approach appears to be more stableand sample efficient in comparison with the baselines.

Additionally, TABLE II outlines the training time of thecompared models. Despite the low performance of the PointerNetwork-based models, they appear to be computational in-efficient during training. Our method can save up to 75% ofthe training time while achieving a lower cost than the PN-dynamic model.

2https://github.com/kevin031060/CSP Attention

Page 9: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 9

0 1000 2000 3000 4000 5000

Training Batches

5

10

15

20

25

30C

ost

Attention-Dynamic

Attention

PN

PN-Dynamic

Fig. 4. Held-out validation set cost as a function of the number of trainingbatches for the compared models.

45000 46000 47000 48000 49000 50000

Training Batches

4.5

5.0

5.5

6.0

6.5

Cos

t

Attention-Dynamic

Attention

PN-Dynamic

Fig. 5. The final costs of PN-dynamic, Attention, Attention-dynamic at theend of the training phase.

TABLE IICOMPARISONS OF TRAINING TIME FOR 100 BATCHES(IN SECONDS).

Method CSP50 CSP100

PN 85.6 176.2PN-dynamic 107.5 237.8AM 26.3 45.9AM-dynamic 32.6 52.6

B. Model performance on test set

To evaluate the model performances, both the small-scaleCSP tasks (CSP20, CSP50, CSP100) and the larger-scale tasks(CSP150, CSP200, CSP300) are used for testing. We mainlyfocus on learning from small-scale training instances and mea-sure the generalization to larger and unseen test instances: themodel trained on CSP20 instances is used to solve randomlygenerated CSP20 and CSP50 test problems; the model trainedon CSP100 instances is used to solve CSP100 and CSP150 testproblems; and the model trained on CSP200 instances is usedto solve CSP200 and CSP300 test problems. We empiricallyfind that the performances deteriorate significantly for all deeplearning methods when using model trained on CSP20 to solveCSP300 instances. Therefore, to fairly evaluate the model

performance, we do not use such experimental setup with alarge gap of the instance size between training and testing. Weuse the average predicted tour length as the cost to evaluatethe model. And the optimality gap of the approaches w.r.t. thebest solver is also used as the performance metric.

TABLE III and TABLE IV presents the performances ofthe compared approaches on small scale CSP tasks (CSP20,CSP50, CSP100) and larger scale CSP tasks (CSP150,CSP200, CSP300) respectively. Results of the non-learnedheuristic methods are shown in the first part of the tables whilethe second part shows the results obtained by the pure deeplearning models without a posterior local search.

Performances over deep learning baselines. It is obviousthat our method clearly outperforms all deep learning base-lines on all CSP tasks. Current state-of-the-art deep learningmethods on solving combinatorial problems fails to solvethe CSP task, leading to a large optimality gap w.r.t. theclassical heuristic solvers. Our method comfortably surpassesthese deep learning baselines with a much smaller gap to theheuristic solvers.

Performances over heuristic baselines. We then conducta simple local search to improve the solutions obtained byour end-to-end model. Results are shown in the third part ofTABLE III and TABLE IV. It can be seen that additionallyconducting a local search can improve the solutions, however,leading to a longer execution time. This observation is con-sistent with the results reported in [30]. The hybrid method inbasis of our attention-dynamic model can achieve significantlycloser optimality (no more than 2% on all tasks except CSP20)to the traditional heuristic method like LS1. Although thereis still an optimality gap, our method runs dramatically fasterthan the heuristic methods. It is about 20 to 40 times faster thanthe LS1&LS2 with the local search, and thousands times fasterthan LS1&LS2 without local search. Note that our approachis trained on a different dataset from the test instances. Theresults are obtained by generalizing the model to unseen CSPtasks with different city locations and dimensions. Thus it isreasonable that there is a slight performance gap between ourmethod and the heuristic solvers. Also it is not our goal todefeat the specialized, carefully designed heuristic solvers.Instead, we aim to show the fast solving speed and highgeneralization ability of the proposed approach.

It should be noticed that the heuristic solvers usually requirea stop searching criterion, e.g., the cost not changing for afixed number of epochs. Therefore, with extra searching pro-cedures, their execution time can be longer than expected. Theabove experiments are conducted with the heuristic methodsoptimized to their best. However, it would be more interestedto see the comparison of running time with all the methodsexecuted to the same level of optimality.

In this context, we additionally design controlled experi-ments to make the comparisons fair and explicit, i.e., compar-ing their execution time by stopping them at the same cost.In specific, the cost obtained by our attention-dynamic(LS)method is set as the benchmark cost and the heuristic methodsimmediately stop once they reach or surpass the definedcost. Their execution time when stopping at the same costis recorded. Results are shown in TABLE V. Clearly, our

Page 10: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 10

approach is more than 10 times faster than the traditionalheuristic solvers if they all reach the same level of optimality.It shows the significantly favorable time complexity of ourapproach than the traditional solvers.

C. Validation of Generalization

In this part, we analyze the generalization ability of theproposed approach to different types of CSP tasks. The modelsare trained on CSP instances with NC=7 (each city can coverits seven nearest neighbors), then are used to solve differenttypes of CSP tasks.

• We first generalize our model to CSP tasks with differentNC. TABLE VI summarizes the results with NC=7, 11and 15 respectively.

• In the above CSP tasks, all cities cover the same numberof nearby cities. In addition, we test the CSP task inwhich each city can cover different number of theirnearby cities. To this aim, we generate CSP instanceswith random NCs ranging from 2 to 15 for each city.The original model trained on instances with NC=7 isused to test this task. TABLE VII presents the averageresults of 100 runs.

• The same model is then used to solve another category ofCSP: each city has a fixed coverage radius within that allother cities are covered. CSP tasks with a fixed coverageradius and variable coverage radius for each city are bothtested. The coverage radius is set to a fixed value 0.2 forthe first task, and the coverage radius is randomly chosenfrom a uniform distribution [0,0.25] for the second task.Results are reported in TABLE VII.

Results in TABLE VI demonstrate that the performance ofour approach drops slightly when generalizing to CSP taskswith different NC values. The optimality gap exceeds 3% and2% for test instances with NC=11 and NC=15 respectively.But our method is still able to produce near-optimal solutionswith tiny optimality gap while requiring significantly lessexecution time compared to the traditional heuristic methods.It is noticed that a large optimality gap is found on CSP100instances with NC=15. From the optimal cost we find itis similar with the results on CSP20 instances with NC=7,in which our model also performs the worst (6.94%) whencomparing to other test instances. It is therefore tentativelyconcluded that our proposed method performs worse whenthe length of the optimal tour is short.

We can see from TABLE VII that our method can success-fully generalize to various CSP tasks.

For CSP tasks in which different city can cover differentnumber of cities, the model still shows a good performance.The optimality gap of our method can be always within 2%for all CSP instances except CSP50. The worse performanceof our approach on CSP50 is consistent with the aboveexperiments. Fig. 6 visualizes the solutions obtained by ourmethod and LS1 on a CSP100 instance with randomly gener-ated NC values. There are only two different cities betweenthe solutions of the two methods, leading to a slight worseperformance of our method than LS1.

Notably, our method outperforms LS1 on CSP tasks withvariable cover radius. Our method also achieves a closeroptimality to LS1 on CSP instances with fixed cover radius.Although the model is not trained on these type of CSP tasks,it can still tackle them successfully. Fig. 7 and Fig. 8 visualizethe solutions on CSP instances with fixed and variable coverradius respectively. It can be seen that the obtained solutionsof the two compared methods are similar.

It is obvious that our approach can always produce a compa-rable solution while requiring significantly less execution time.The model only trained on CSP instances of NC=7 can be usedto tackle various types of CSP tasks. This generalization ismainly achieved by the dynamic embeddings that we design.Once a city is covered, its probability of being selected isdynamically adjusted by the learned model parameters.

(a) Attention-dynamic-LS (3.03) (b) LS1 (2.96)

Fig. 6. Solutions of a CSP100 instance where each city can cover a randomnumber of cities. Tour lengths of the solutions are provided.

(a) Attention-dynamic-LS (3.28) (b) LS1 (3.16)

Fig. 7. Solutions of a CSP100 instance with a fixed cover radius for eachcity. Tour lengths of the solutions are provided.

VIII. CONCLUSION

Covering salesman problems arise in numerous applicationdomains such as the disaster planning. However, with theproblem dimension increasing in real-world applications, tra-ditional approaches suffer from the limitation of forbiddingexecution time. Moreover, considerable development effortsare required to design the heuristics. In this context, weintroduce a new deep learning approach for the CSP in thisstudy. Different from the traditional solvers, in this approach,a deep neural network is used to directly output a permutation

Page 11: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 11

TABLE IIIATTENTION-DYNAMIC METHOD VS BASELINES ON SMALL-SCALE CSPS. RESULTS OF THE COST AND THE EXECUTION TIME ARE RECORDED IN AVERAGE

BY EVALUATING THE MODELS ON 100 RANDOMLY GENERATED INSTANCES. THE GAP IS COMPUTED W.R.T. THE BEST PERFORMING APPROACH.

MethodCSP20 CSP50 CSP100

Cost Gap Time/s Cost Gap Time/s Cost Gap Time/s

LS1 1.87 8.09% 4.24 2.57 0.00% 15.04 3.58 0.00% 88.85LS2 1.73 0.00% 6.55 2.67 3.89% 21.61 3.69 3.07% 107.98

PN 4.27 146.82% 0.019 7.85 205.45% 0.027 12.05 236.59% 0.042PN-dynamic 2.15 24.28% 0.011 3.29 28.02% 0.015 4.68 30.73% 0.033AM 2.23 28.90% 0.014 3.87 50.58% 0.041 5.02 40.22% 0.064AM-dynamic 1.89 9.25% 0.014 2.75 7.00% 0.032 4.08 13.97% 0.063

AM-dynamic (LS) 1.85 6.94% 0.5 2.64 2.72% 0.71 3.65 1.96% 2.66

TABLE IVATTENTION-DYNAMIC METHOD VS BASELINES ON LARGE-SCALE CSPS. RESULTS OF THE COST AND THE EXECUTION TIME ARE RECORDED IN AVERAGE

BY EVALUATING THE MODELS ON 100 RANDOMLY GENERATED INSTANCES. THE GAP IS COMPUTED W.R.T. THE BEST PERFORMING APPROACH.

Method150-TSP 200-TSP 300-TSP

Cost Gap Time/s Cost Gap Time/s Cost Gap Time/s

LS1 4.28 0.00% 331.45 4.86 0.00% 755.77 5.93 0.00% 2637.55LS2 4.49 4.91% 362.28 5.16 6.17% 695.77 6.34 6.91% 2428.75

PN 16.25 279.67% 0.087 22.54 363.79% 0.117 29.68 400.51% 0.228PN-dynamic 6.47 51.17% 0.042 8.44 73.66% 0.076 11.17 88.36% 0.128AM 6.11 42.76% 0.102 8.28 70.37% 0.162 10.52 77.40% 0.291AM-dynamic 5.19 21.26% 0.071 6.01 23.66% 0.082 7.62 28.50% 0.106

AM-dynamic (LS) 4.34 1.40% 7.99 4.93 1.44% 4.93 5.98 0.84% 85.08

TABLE VCOMPARISONS OF THE EXECUTION TIME WHEN ALL THE APPROACHES REACH A SAME COST.

Method50-TSP 100-TSP 150-TSP 200-TSP 300-TSP

Cost StopTime Cost StopTime Cost StopTime Cost StopTime Cost StopTime

LS1 2.619 2.35 3.632 27.69 4.332 128.64 4.93 327.59 5.980 1295.65LS2 2.642 11.14 3.742 71.43 4.476 290.66 5.086 559.48 6.266 2697.78AM-dynamic (LS) 2.641 0.78 3.656 2.71 4.337 8.07 4.932 22.14 5.972 85.25

TABLE VIATTENTION-DYNAMIC METHOD VS HEURISTIC BASELINES ON CSP TASKS WITH DIFFERENT NCS. THE AVERAGE RESULTS OF 100 RUNS ARE

PRESENTED.

NC MethodCSP100 CSP150 CSP200 CSP300

Cost Gap Time/s Cost Gap Time/s Cost Gap Time/s Cost Gap Time/s

7LS1 3.58 0.00% 88.85 4.28 0.00% 331.45 4.86 0.00% 755.77 5.93 0.00% 2637.55LS2 3.69 3.07% 107.98 4.49 4.91% 362.28 5.16 6.17% 695.77 6.34 6.91% 2428.75AM-dynamic (LS) 3.65 1.96% 2.66 4.34 1.40% 7.99 4.93 1.44% 4.93 5.98 0.84% 85.08

11LS1 2.94 0.00% 69.6 3.51 0.00% 230.38 4.01 0.00% 557.82 4.93 0.00% 1015.65LS2 3.03 3.06% 55.92 3.69 5.13% 144.77 4.22 5.24% 343.93 5.22 5.88% 810.56AM-dynamic (LS) 3 2.04% 2.69 3.61 2.85% 5.92 4.15 3.49% 8.29 5.02 1.83% 28.45

15LS1 2.58 0.00% 57.88 3.09 0.00% 170.93 3.53 0.00% 214.22 4.25 0.00% 1379.24LS2 2.64 2.33% 35.69 3.21 3.88% 96.37 3.65 3.40% 425.66 4.44 4.47% 720.74AM-dynamic (LS) 3.02 17.05% 1.12 3.17 2.59% 4.22 3.6 1.98% 12.32 4.33 1.88% 35.26

Page 12: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 12

TABLE VIIATTENTION-DYNAMIC METHOD VS HEURISTIC BASELINES ON VARIOUS CSP TASKS. THE AVERAGE RESULTS OF 100 RUNS ARE PRESENTED.

CSP tasks MethodCSP50 CSP100 CSP150 CSP200

Cost Gap Time/s Cost Gap Time/s Cost Gap Time/s Cost Gap Time/s

Varible NCsLS1 2.34 0.00% 5.62 2.96 0.00% 18.76 3.65 0.00% 93.25 4.14 0.00% 175.68LS2 2.36 0.85% 5.33 3.02 2.03% 22.53 3.77 3.29% 67.65 4.27 3.14% 128.65AM-dynamic (LS) 2.54 8.55% 0.36 3.01 1.69% 1.56 3.7 1.37% 2.82 4.2 1.45% 6.52

Fixed radiusLS1 2.94 0.00% 17.63 2.96 0.00% 49.1 2.93 0.00% 119.34 2.93 0.00% 237.54LS2 3.13 6.46% 21.4 3.04 2.70% 57.58 3.01 2.73% 87.78 2.96 1.02% 124.57AM-dynamic (LS) 2.96 0.68% 1.52 2.99 1.01% 1.48 2.99 2.05% 2.42 2.97 1.37% 4.55

Varible radiusLS1 3.71 7.85% 17.84 3.41 0.59% 53.09 3.29 0.00% 81.28 3.17 0.63% 115.38LS2 3.76 9.30% 20.53 3.62 6.78% 47.24 3.56 8.21% 63.33 3.46 9.84% 85.21AM-dynamic (LS) 3.44 0.00% 1.43 3.39 0.00% 9.56 3.31 0.61% 6.35 3.15 0.00% 3.45

(a) Attention-dynamic-LS (3.26) (b) LS1 (3.41)

Fig. 8. Solutions of a CSP100 instance where each city has a random coveringradius. Tour lengths of the solutions are provided.

of the input. The model is trained using the REINFORCEalgorithm with greedy rollout in an unsupervised manner.As an end-to-end approach, this method shows desirableproperties of fast solving speed and the ability of generalizingto unseen instances. There is still an optimality gap of ourmethod with respect to traditional solvers. However, in someoccasions where the problem is large-scale and requires quickdecision, the proposed approach is highly desirable as itrequires significantly less execution time. Moreover, esotericdomain knowledge and trial-and-error process is needed todesign traditional heuristic methods. Years of effort has beendevoted to engineer the heuristic rules for CSPs, however, noactual breakthrough is accomplished. Most of the designedheuristics are problem-specific and need to be revised oncethe problem setting changes. Thus the deep learning approachproposed in this study can be more favorable as it can learnthe heuristic from the data by itself and automate the processof decision-making.

The proposed method provides a desirable trade-off of thesolution quality and the execution time. It also significantlyoutperforms the current state-of-the-art deep learning ap-proaches for combinatorial optimization. However, the bettersolution quality is still the significant advantage of traditionalsolvers. Improving the solution quality of the deep learningapproach is an urgent research perspective in the future. Themain challenge is how to handle the dynamic feature of

the problem more effectively. Better mechanism should bedeveloped to enable the agent to understand the change ofthe context when a city is visited and others are covered, andhow this dynamic change affects the policy.

REFERENCES

[1] J. R. Current and D. A. Schilling, “The covering salesman problem,”Transportation science, vol. 23, no. 3, pp. 208–213, 1989.

[2] S. R. Shariff, N. H. Moin, and M. Omar, “Location allocation modelingfor healthcare facility planning in malaysia,” Computers & IndustrialEngineering, vol. 62, no. 4, pp. 1000–1010, 2012.

[3] D. Reina, S. T. Marin, N. Bessis, F. Barrero, and E. Asimakopoulou,“An evolutionary computation approach for optimizing connectivity indisaster response scenarios,” Applied Soft Computing, vol. 13, no. 2, pp.833–845, 2013.

[4] N. Vesselinova, R. Steinert, D. F. Perez-Ramirez, and M. Boman,“Learning combinatorial optimization on graphs: A survey with applica-tions to networking,” IEEE Access, vol. 8, pp. 120 388–120 416, 2020.

[5] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Masteringthe game of go without human knowledge,” nature, vol. 550, no. 7676,pp. 354–359, 2017.

[6] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Advancesin neural information processing systems, 2015, pp. 2692–2700.

[7] W. Kool, H. van Hoof, and M. Welling, “Attention, learn to solve routingproblems!” arXiv preprint arXiv:1803.08475, 2018.

[8] B. Golden, Z. Naji-Azimi, S. Raghavan, M. Salari, and P. Toth, “Thegeneralized covering salesman problem,” INFORMS Journal on Com-puting, vol. 24, no. 4, pp. 534–553, 2012.

[9] M. Salari and Z. Naji-Azimi, “An integer programming-based localsearch for the covering salesman problem,” Computers & OperationsResearch, vol. 39, no. 11, pp. 2594–2602, 2012.

[10] M. H. Shaelaie, M. Salari, and Z. Naji-Azimi, “The generalized coveringtraveling salesman problem,” Applied Soft Computing, vol. 24, pp. 867–878, 2014.

[11] P. Venkatesh, G. Srivastava, and A. Singh, “A multi-start iteratedlocal search algorithm with variable degree of perturbation for thecovering salesman problem,” in Harmony Search and Nature InspiredOptimization Algorithms. Springer, 2019, pp. 279–292.

[12] M. Salari, M. Reihaneh, and M. S. Sabbagh, “Combining ant colonyoptimization algorithm and dynamic programming technique for solvingthe covering salesman problem,” Computers & Industrial Engineering,vol. 83, pp. 244–251, 2015.

[13] S. P. Tripathy, A. Tulshyan, S. Kar, and T. Pal, “A metameric geneticalgorithm with new operator for covering salesman problem with fullcoverage,” Int J Control Theory Appl, vol. 10, no. 7, pp. 245–252, 2017.

[14] V. Pandiri, A. Singh, and A. Rossi, “Two hybrid metaheuristic ap-proaches for the covering salesman problem,” Neural Computing andApplications, pp. 1–21, 2020.

[15] V. Pandiri and A. Singh, “An artificial bee colony algorithm with variabledegree of perturbation for the generalized covering traveling salesmanproblem,” Applied Soft Computing, vol. 78, pp. 481–495, 2019.

[16] J. J. Hopfield and D. W. Tank, ““neural” computation of decisions inoptimization problems,” Biological cybernetics, vol. 52, no. 3, pp. 141–152, 1985.

Page 13: JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, …

JOURNAL OF IEEE TRANSACTIONS ON CYBERNETICS, VOL. 14, NO. 8, FEBRUARY 2021 13

[17] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” Advances in neural information processingsystems, vol. 27, pp. 3104–3112, 2014.

[18] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,“The graph neural network model,” IEEE Transactions on NeuralNetworks, vol. 20, no. 1, pp. 61–80, 2008.

[19] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neuralcombinatorial optimization with reinforcement learning,” arXiv preprintarXiv:1611.09940, 2016.

[20] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advancesin neural information processing systems, 2000, pp. 1008–1014.

[21] M. Nazari, A. Oroojlooy, L. V. Snyder, and M. Takac, “Deep reinforce-ment learning for solving the vehicle routing problem,” arXiv preprintarXiv:1802.04240, 2018.

[22] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[23] E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song, “Learningcombinatorial optimization algorithms over graphs,” in Advances inneural information processing systems, 2017, pp. 6348–6358.

[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv preprint arXiv:1312.5602, 2013.

[25] A. Mittal, A. Dhawan, S. Manchanda, S. Medya, S. Ranu, and A. Singh,“Learning heuristics over large graphs via deep reinforcement learning,”arXiv preprint arXiv:1903.03332, 2019.

[26] A. Nowak, S. Villar, A. S. Bandeira, and J. Bruna, “A note on learningalgorithms for quadratic assignment with graph neural networks,” inProceeding of the 34th International Conference on Machine Learning(ICML), vol. 1050, 2017, p. 22.

[27] C. K. Joshi, T. Laurent, and X. Bresson, “An efficient graph convo-lutional network technique for the travelling salesman problem,” arXivpreprint arXiv:1906.01227, 2019.

[28] Z. Li, Q. Chen, and V. Koltun, “Combinatorial optimization with graphconvolutional networks and guided tree search,” in Advances in NeuralInformation Processing Systems, 2018, pp. 539–548.

[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advancesin neural information processing systems, 2017, pp. 5998–6008.

[30] M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L.-M. Rousseau,“Learning heuristics for the tsp by policy gradient,” in Internationalconference on the integration of constraint programming, artificialintelligence, and operations research. Springer, 2018, pp. 170–181.

[31] G. A. Croes, “A method for solving traveling-salesman problems,”Operations research, vol. 6, no. 6, pp. 791–812, 1958.

[32] K. Li, T. Zhang, and R. Wang, “Deep reinforcement learning formultiobjective optimization,” IEEE Transactions on Cybernetics, 2020.

[33] K. Abe, I. Sato, and M. Sugiyama, “Solving np-hard problems on graphsby reinforcement learning without domain knowledge,” Simulation,vol. 1, pp. 1–1, 2019.

[34] E. Yolcu and B. Poczos, “Learning local search heuristics for booleansatisfiability,” in Advances in Neural Information Processing Systems,2019, pp. 7992–8003.

[35] C. K. Joshi, T. Laurent, and X. Bresson, “On learning paradigms for thetravelling salesman problem,” arXiv preprint arXiv:1910.07210, 2019.

[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[37] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations usingrnn encoder-decoder for statistical machine translation,” arXiv preprintarXiv:1406.1078, 2014.

[38] R. J. Williams, “Simple statistical gradient-following algorithms forconnectionist reinforcement learning,” Machine Learning, vol. 8, no.3-4, pp. 229–256, 1998.

[39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.