23
Efficient Indexing and Query Processing for Secure Dynamic Cloud Storage Qian Wang 1 , Minxin Du 1 , Kui Ren 2 , Meiqi He 3 , and Jian Weng 4 1 Dept. of CS, Wuhan University 2 Dept. of CSE, University at Buffalo, SUNY 3 Dept. of CS, The University of Hong Kong 4 Dept. of CS, Jinan University [email protected], [email protected] Abstract. With the increasing popularity of cloud-based data services, data owners are highly motivated to store their huge amount of potentially sensitive personal data files on remote servers in encrypted form. Clients later can query over the encrypted database to retrieve files of interest while preventing database servers from learning private information about the contents of files and queries. Prior solutions, however, either have high search time complexity and storage cost with privacy guarantee un- der weaker security models, or cannot support efficient data operations and parallel computations, or cannot achieve forward privacy and leak a lot of additional infor- mation. In this paper, we investigate new and novel privacy-preserving indexing and query processing protocols which meet a number of desirable properties, including the multi-keyword query processing with conjunction and disjunction logic queries, practically high privacy guarantee with non-adaptive/adaptive chosen keyword at- tack (CKA1/CKA2) security and forward privacy, highly efficient query processing, and the support of dynamic data operations etc. Compared to previous schemes, our solution is highly compact, efficient and flexible. Its performance and security are carefully characterized by rigorous analysis. Experimental evaluations conducted over a large representative dataset demonstrate that our solutions can achieve high search time efficiency, and they are practical for use in large-scale encrypted database systems. Key words: Cloud storage, searchable encryption, multi-keyword query, homomor- phic encryption 1 Introduction In the cloud computing paradigm, providing database-as-a-service (DaaS) allows a third party service provider to host database as a service, providing its customers seamless mech- anisms to create, store, and access databases at cloud with adequate storage resource, con- venient data access and reduced management and infrastructure costs [1]. But database outsourcing also raises data confidentiality and privacy concerns due to data owner’s loss of physical data control. To provide privacy guarantees for sensitive data such as personal identities, health records and financial data etc., a straightforward approach is to encrypt the sensitive data locally before outsourcing [1–3]. While providing strong end-to-end privacy, encryption becomes a hindrance to data computation or utilization, i.e., it is hard to retrieve data files based on their content as in the plaintext search domain. In addition, clients are also concerned about their query privacy, expecting that not only the data content but also the query purposes in plaintext form not to be leaked to the remote database server. To allow a client to retrieve encrypted data files of interest, many searchable symmet- ric encryption (SSE) solutions have been proposed [4–15]. Previous SSE solutions, how- ever, either have high search time complexity and privacy guarantee under weaker security model [4, 5], or cannot support efficient data operations and parallel computations [7, 8], or introduce relatively large storage cost and low search efficiency [10,11], or cannot achieve

Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

Efficient Indexing and Query Processing for SecureDynamic Cloud Storage

Qian Wang1, Minxin Du1, Kui Ren2, Meiqi He3, and Jian Weng4

1Dept. of CS, Wuhan University2Dept. of CSE, University at Buffalo, SUNY3Dept. of CS, The University of Hong Kong

4Dept. of CS, Jinan [email protected], [email protected]

Abstract. With the increasing popularity of cloud-based data services, data ownersare highly motivated to store their huge amount of potentially sensitive personal datafiles on remote servers in encrypted form. Clients later can query over the encrypteddatabase to retrieve files of interest while preventing database servers from learningprivate information about the contents of files and queries. Prior solutions, however,either have high search time complexity and storage cost with privacy guarantee un-der weaker security models, or cannot support efficient data operations and parallelcomputations, or cannot achieve forward privacy and leak a lot of additional infor-mation. In this paper, we investigate new and novel privacy-preserving indexing andquery processing protocols which meet a number of desirable properties, includingthe multi-keyword query processing with conjunction and disjunction logic queries,practically high privacy guarantee with non-adaptive/adaptive chosen keyword at-tack (CKA1/CKA2) security and forward privacy, highly efficient query processing,and the support of dynamic data operations etc. Compared to previous schemes,our solution is highly compact, efficient and flexible. Its performance and securityare carefully characterized by rigorous analysis. Experimental evaluations conductedover a large representative dataset demonstrate that our solutions can achieve highsearch time efficiency, and they are practical for use in large-scale encrypted databasesystems.

Key words: Cloud storage, searchable encryption, multi-keyword query, homomor-phic encryption

1 Introduction

In the cloud computing paradigm, providing database-as-a-service (DaaS) allows a thirdparty service provider to host database as a service, providing its customers seamless mech-anisms to create, store, and access databases at cloud with adequate storage resource, con-venient data access and reduced management and infrastructure costs [1]. But databaseoutsourcing also raises data confidentiality and privacy concerns due to data owner’s lossof physical data control. To provide privacy guarantees for sensitive data such as personalidentities, health records and financial data etc., a straightforward approach is to encrypt thesensitive data locally before outsourcing [1–3]. While providing strong end-to-end privacy,encryption becomes a hindrance to data computation or utilization, i.e., it is hard to retrievedata files based on their content as in the plaintext search domain. In addition, clients arealso concerned about their query privacy, expecting that not only the data content but alsothe query purposes in plaintext form not to be leaked to the remote database server.

To allow a client to retrieve encrypted data files of interest, many searchable symmet-ric encryption (SSE) solutions have been proposed [4–15]. Previous SSE solutions, how-ever, either have high search time complexity and privacy guarantee under weaker securitymodel [4, 5], or cannot support efficient data operations and parallel computations [7, 8], orintroduce relatively large storage cost and low search efficiency [10, 11], or cannot achieve

Page 2: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

2 Qian Wang1, Minxin Du1, Kui Ren2, Meiqi He3, and Jian Weng4

forward privacy and leak a lot of additional information [9,12,14]. In recent years, researchershave put great effort into the practicality of SSE constructions. Kamara et al. [9] designeda two-level dynamic encrypted index based on the inverted index approach. As a followingwork, Kamara et al. [12] further proposed a parallelizable and dynamic SSE scheme basedon the red-black tree data structure, achieving stronger privacy guarantee for data updates.In [13], Cash et al. proposed a new SSE construction, supporting sublinear multi-keywordBoolean queries with relatively high computational burden on the client. Hahn et al. [14]recently proposed a similar construction which also used link lists to construct the invertedindex. However, a data structure history which stores previously-used search tokens needsto be maintained at the client side, and the server has to traverse a sub-index with timecost linear to the number of document-keyword pairs. Note that the above dynamic DSSEschemes cannot achieve forward privacy, which means that the database server cannot learnwhether an updated data file contains a keyword w that have been searched or not in thepast. In fact, besides forward privacy, additional information such as the hashes/search to-kens of keywords that have not been queried are leaked as well. To address this problem,Stefanov et al. [15] recently introduced a new SSE scheme that achieves forward privacy.Their main idea is to store document-keyword pairs in a hierarchical structure of logarithmiclevels by using algorithmic techniques in ORAM. Unfortunately, their search time cost isextremely large with the increase of the number of document-keyword pairs, and the updatesof data files requires several rounds of communication between the server and the client withnon-negligible communication and computation costs.

In this paper, we propose a suite of new and novel SSE index designs for processing queriesover encrypted cloud storage. Our construction carefully makes a trade-off between queryefficiency and privacy, achieving flexible query functionalities. In particular, our solutionhas a very compact index construction with practical privacy guarantee (including forwardprivacy) and supports multi-keyword query (including both conjunction and disjunctionlogic queries), scalability and highly efficient query processing and data updates. Our maintechnical contributions are as follows.

– We propose a bucket-encrypting index structure (BEIS) with random generator, namedBEIS-I. On the basis of probabilistic inverted index coding structure, we efficientlyencrypt data identifier vectors (DIVs) using random numbers. BEIS-I naturally allowshighly efficient one-round multi-keyword query with practical privacy guarantee, i.e.,CKA1-security. Due to the decomposable index structure, BEIS-I naturally supportsefficient and secure dynamic data updates.

– We propose a bucket-encrypting index structure (BEIS) with homomorphic generator,named BEIS-II, to further enhance the index privacy and achieve CKA2-security inthe random oracle model. BEIS-II encrypts the basic probabilistic index structure usingadditively homomorphic encryption and PRFs. Due to its support of efficient summationoperations, query processing over BEIS-II can also achieve efficient and secure batch dataupdates by fully leveraging the cloud capability with moderate communication costs.

– We present both correctness analysis and formal security analysis of our schemes, show-ing that our index construction enjoys security against non-adaptive/adaptive chosen-keyword attacks and also achieves forward privacy. To demonstrate the practicality ofour solutions, we conduct experimental evaluations using large representative data sets.

2 Related Work, Preliminaries and Definitions

2.1 Related Work

Following the idea of trading security for efficiency, the first notable SSE scheme was pro-posed by Song et al. [4] with weaken security, and the search time of the scheme is linear inthe length of the file collection. By associating an encrypted index to each file, the SSE con-struction proposed by Goh [5] achieves search time that is proportional to #d, the number

Page 3: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

Sample Title 3

of files in the collection. In particular, a formal security model (IND-CKA) with respect toindex indistinguishability was first introduced to prove the semantic security of indexes. Oneappealing property of this solution is that it handles dynamic data efficiently and securely(without revealing any private information about the contents of data and keywords). In2006, Curtmola et al. [7] generalized security definitions of SSE (CKA1 and CKA2) andproposed a new encrypted index structure that associates with the entire file collection. Theresulting search time is optimal and sublinear in #dw, the number of files that contain key-word w. Kamara et al. [9] extended the inverted index approach [7] and designed a two-leveldynamic encrypted index which allows for both addition and removal of data files. Thisconstruction achieves the overall best performance in single-core architectures due to itshigh query efficiency and dynamic index. Motivated by advances in multi-core architectures,Kamara et al. [12] proposed a parallelizable and dynamic SSE scheme based on red-blacktree data structure. The search executes in parallel O(#dw

p log#d) time, where p and #dare the number of the available processors and the total data files, respectively.

Recently, a similar construction which used link lists to construct the inverted index, wasproposed by Hahn et al. [14]. The client in their scheme need to maintain a data structurehistory σ which stores previously-used search tokens in order to get rid of the time costof O(|d| · |σ|), where |d| is the number of keywords in newly-added file d and |σ| is thenumber of link lists in the γw during file additions. The good thing is that if the tokenswere submitted before during the search, then the server can return the matched resultsimmediately according to the index γw. However, in any case the server has to traverse thewhole index γf and this time cost is linearly to the number of document-keyword pairs. Moreseriously, the above dynamic DSSE schemes such as [9,12,14] cannot achieve forward privacy,and they also leak a lot of additional information during the updates, e.g., the keywordhashes shared between data files besides the keyword hashes of the queried keywords. Therecently introduced SSE scheme of Stefanov et al. [15] achieves forward privacy. It storeddocument-keyword pairs in a hierarchical structure of logarithmic levels, which is reminiscentof algorithmic techniques used in the ORAM literature. From the perspective of performance,its worst search complexity is O(mlog3N), where m is the number of documents containingquery keyword and N is the number of document-keyword pairs. Although the theoreticalcost is sub-linear, the actual search time cost will be extremely large with the increase ofN . On the other side, each update in their schemes induces O(log(N)) bandwidth and takesO(log2(N)) time in the worst-case. When one level needs to be rebuilt during update, it mustensure that the client has O(Nα) local storage for 0 < α < 1, and it also induces severalrounds of communication between the server and the client. Moreover, it needs to resizethe whole data structure when the number of deleted entries equals N/2, otherwise it willcause a blow-up in the space of the data structure. Besides, from the perspective of searchfunctionality, one common limitation of the above SSE solutions is that they only supportsingle-keyword search, the direct extension to multi-keyword search requires to compute theintersection of disjunctive queries for each keyword.

The multi-keyword search problem over encrypted data was investigated by Cao et.al [11]and a multi-keyword ranked search scheme was proposed using the secure kNN computationtechnique [16]. However, their approach is limited by the large index storage cost due to itsassociated large dictionary, and the low search efficiency due to its high vector computationcost introduced in the one-to-one search. Based on similar approaches, Wang et al. [17]proposed a privacy-preserving multi-keyword fuzzy search scheme over encrypted data byusing locality sensitive hashing (LSH) and secure kNN. Still, like [11], this scheme alsolacks formal and thorough analysis under a well-defined and rigorous security model asacknowledged by [16]. In the recent literature [13], the authors designed the first SSE schemecalled OXT which supports sublinear multi-keyword Boolean queries by providing trade-offsbetween performance and privacy. A key assumption made in their work is that the clientor the server maintains a state called s-term, which is the conjunctive term chosen as theestimated least frequency keywords among the query keywords. The choice of s-terms highly

Page 4: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

4 Qian Wang1, Minxin Du1, Kui Ren2, Meiqi He3, and Jian Weng4

depends on general statistics about attributes (in a rational database) or keyword/termfrequency for free texts. For each query, their schemes have to execute at most three roundsof interactions for the security concern, and put too much computational burden on theclient (the time complexity of the trapdoor generation in terms of token is O(#q#dw),where a token can be roughly expressed as gx and #q is the number of query keywords. Aswith most of the existing dynamic SSE schemes, OXT also fails to achieve forward privacy.

2.2 Problem Statement and Notations

In cloud-based database systems, the data owner outsources a large-scale collection of datafiles d = (d1, . . . , d#d) to the remote database server in encrypted form c = (c1, . . . , c#c).The data files d can represent text files or records in a relational database. To enable thequery service over c for effective information retrieval, the data owner needs to build anencrypted index γ based on d and outsources (γ, c) to the remote database server. Later, aclient (for ease of exposition, in the following statements when we mention a client, we refer itas the data owner or an authorized user) can use a multi-keyword query q = {w1, . . . , w#q}(note that for a relational database, keywords are attribute-value pairs) to query and retrievedata files of interest in the database. To this end, the client generates a search token τ andsends it to the remote cloud server. After receiving τ , the cloud executes the query on γand returns all the encrypted data files cq or their corresponding encrypted IDs that satisfythe query. Each of cq contains all the query keywords in q for the ‘AND’ logic query, or itcontains the ranked encrypted data files for the ‘OR’ logic query based on a specific similaritymetric. Finally, the client locally decrypts cq and obtain data contents in plaintext form.

The database system is a dynamic one. That is, at any time the data owner or authorizedusers may add or remove a set of data files (batch file updates) from the remote database.To do so, the client generates an encrypted update token τa/τd. Given τa/τd, the databaseserver can execute secure updates on γ and c. Formally, a searchable encrypted databasesystem is defined as below.

Definition 1. (A Searchable Encrypted Database) A searchable encrypted database consistsof a sequence of polynomial-time algorithms (KeyGen, BuildIndex, Trapdoor, Query, Recovery,UpdateTokenGen, Update) such that:

K ← KeyGen(1k): is a probabilistic key generation algorithm run by the data owner. It takesas input a security parameter k and outputs the key K.

(γ, c) ← BuildIndex(K,d): is a (possibly probabilistic) algorithm run by the data owner tobuild the encrypted index. It takes as input the key K and data collection d and outputs anencrypted index γ and the ciphertexts c.

τ ← Trapdoor(K,q): is a (possibly probabilistic) algorithm run by the data owner or autho-rized users to generate a search token. It takes as input K and a given query request q, andoutputs the trapdoor τ .

cq/r ← Query(τ ,γ, c): is a deterministic algorithm that run by the database server. It takesas input the search token τ , the encrypted index γ and c and outputs a sequence of encrypteddata files cq or their corresponding encrypted file IDs r.

id(cq) ← Recovery(K, r): is a deterministic algorithm that run by the data owner or autho-rized users. It takes as input the key K and the encrypted file IDs r, and outputs file IDs ofcq.

τa/τd ← UpdateTokenGen(K, dj): is a (possibly probabilistic) algorithm that takes as inputthe key K and a data file dj to be added or deleted, and outputs an update token τa/τd.

(γ′, c′) ← Update(τa/τd,γ, c): is a deterministic algorithm run by the database server. Ittakes as input the encrypted index γ and a update token τa/τd and outputs the updatedencrypted data collection c′ and index γ′.

Note that in UpdateTokenGen, the input dj can be replaced by a set of data files and thusthe output update token τa/τd is used for batch file updates. We use id(dj) to denote

Page 5: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

Sample Title 5

the identifier of data file dj and id(d) to denote the identifiers of all data files in d. Thekeyword universe (which contains all distinct keywords) extracted from d is denoted byw = (w1, . . . , w#w). In our construction, we use div ∈ {0, 1}#d to denote a data identifiervector (DIV), where the jth entry of div is 1 if dj is included; 0 otherwise. By this defini-tion, given wi, we can use divi or divwi to denote all data files that contain wi (i.e., thecorresponding entries in divi are 1’s.). Different from most previous SSE constructions, weconsider multi-keyword query q = {w1, . . . , w#q}. Note that, for ease of understanding,when #q = 1 we use w to denote the single-keyword query, i.e., q = {w}, in the follow-ing discussion. We use dq to denote all the data files that contain q for the conjunctionquery logic (for simplicity, we also abuse the notation dq to denote the data files for thedisjunction logic query). Similarly, we use dw to denote all the data files that contain w.Standard bitwise Boolean operations are defined on binary vectors: bitwise OR (union) “

∨”

and bitwise AND (intersaction) “∧”, which also take the same notations for single bit OR

and AND operations, respectively. Given a vector v, we refer its jth element as v[j] or vj .

2.3 Security Model and Definitions

We follow the widely-accepted security definitions of SSE in [5, 7–9], which are properlytuned for our construction.

Definition 2. (Search Pattern π). Given a search query q at time t, the search patternπ(d, q) is defined as a binary vector of length t with a 1 at location i if the search at timei ≤ t was for q; 0 otherwise.

The search pattern reveals whether the same search was performed in the past or not. Inthe above definition, for the single-keyword query, q = {w}.

Definition 3. (Access Pattern Ap). Given a search query q at time t, the access pattern isdefined as Ap(d, q) = {id(cq)}.

Specifically, in the above definition, Ap(d,q) = id(dq) = id(cq). For the single-keywordquery, Ap(d,q) = {id(cw)}. Note that, in a one-round query the access pattern explicitlyincludes id(cq) while in a two-round query the access pattern implicitly includes intermedi-ate results, e.g., the encrypted div’s extracted from the hit buckets. We emphasize that theencrypted div’s and their encrypted aggregation (see Fig. 1) do not reveal additional infor-mation beyond id(cq) since they are encrypted intermediate results before the recovery ofid(cq).

Definition 4. (CKA1/CKA2-security for A Searchable Encrypted Database). A searchableencrypted database is said to be non-adaptively secure or CKA1-secure (adaptively secure orCKA2-secure), if for any PPT stateful adversary A, there exists a PPT stateful simulatorS such that for the following probabilistic experiments:

RealBEIS,A(k) : the challenger runs KeyGen(1k) to generate the key K. A outputs d andreceives (γ, c)← BuildIndexK(d) from the challenger. A makes a polynomial number ofnon-adaptive (adaptive) queries q’s. For each query q, A receives from the challenger asearch token τ such that τ ← TrapdoorK(q). Finally, based on (γ, c) and the receivedsearch tokens τ ’s, A returns a bit b that is output by the experiment.

IdealBEIS,A,S(k) : A outputs d and sends it to S. Given L1(d), S generates and sends (γ, c)to A. A makes a polynomial number of non-adaptive (adaptive) queries q’s. For eachquery q, S is given L2(d,q), and it returns an appropriate a search token τ . Finally,based on (γ, c) and the received search tokens τ ’s, A returns a bit b that is output bythe experiment.

Page 6: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

6 Qian Wang1, Minxin Du1, Kui Ren2, Meiqi He3, and Jian Weng4

|Pr[RealBEIS,A(k) = 1]− Pr[IdealBEIS,A,S(k) = 1]| ≤ negl(k), (1)

where L1 and L2 are stateful leakage algorithms.Note that security against non-adaptive chosen-keyword attacks (CKA1) guarantees se-

curity when the client’s queries are independent of the encrypted index and the results ofprevious queries, i.e., the queries are performed in a batch. In contrast, security againstadaptive chosen-keyword attacks (CKA2) provides security when the query keywords arechosen as a function of the encrypted index and the results of previous queries. That meansthe simulator should be able to simulate the encrypted index (consistent with future un-known queries) before it receives any search results (i.e., leakage).

3 Our Constructions

3.1 Probabilistic Coding Data Structure: The Basic Construction

Based on Definition 1, we present the basic structure of our index construction.

KeyGen(1k): Given a security parameter k, generate two random k-bit strings sk1 and λ.Call sk2 ← SKE.KeyGen(1k, λ), where SKE is a PCPA-secure symmetric encryption scheme.Output K = (sk1, sk2).

BuildIndex(K,d): Extract w = {w1, . . . , w#w} from d and generate {div1, . . . , div#w}. Con-struct a data structure γ consisting of m buckets. Each bucket of γ is initially set to beempty. Select a collection of r independent pseudo-random functions hi=[1,r] = {hi|i ∈[1, r]}, where hi : {0, 1}∗ × {0, 1}k → {0, 1}k. For each wi (i ∈ [1,#w]), compute (x1 =h1(wi, sk1), . . . , xr = hr(wi, sk1)). For each xj (j ∈ [1, r]), insert divi into bucket γ[xj ] ac-cording to the following rule: If the bucket γ[xj ] is empty, store divi in the bucket, denoted byγ[xj ] = divi; otherwise update the bucket by storing the “union” of divi and the “old” DIVpreviously stored in bucket γ[xj ], denoted by γ[xj ] = γ[xj ]

∨divi. Let c← SKE.Enc(sk2,d).

Output (γ, c).

Trapdoor(sk,q): For each wi (i ∈ [1,#q]), compute (x1 = h1(wi, sk1), . . . , xr = hr(wi, sk1)).Generate the union of all positions as τ . Output τ .

Query(γ, c, τ ): Execute τ over γ according to the client-specified query logic (‘AND’ or‘OR’.):1) Conjunction logic query. The server computes divτ from the buckets at the all positionsin τ as divτ =

∧y∈τ γ[y]. Recover data identifiers from divτ and output the corresponding

encrypted data files. To state the correctness, let τw be the sub-trapdoor for a single keywordw in q. Thus, τw ⊂ τ and

divτ =∧w∈q

∧y∈τw

γ[y] =∧

w∈q,y∈τw

γ[y] =∧y∈τ

γ[y]. (2)

It is easy to see that the execution of conjunction logic query should return data files thatcontain all keywords in the multi-keyword query q. In the following, we also define dis-junction logic query, the execution of which will return data files that contain as many askeywords in the multi-keyword query q, based on a pre-defined similarity metric.2) Disjunction logic query. The server computes the similarity score, denoted by score(dj ,q),between data file dj and query q. We define the score for file dj as

score(dj ,q) = r ·1(∧

y∈{τ [1],...,τ [r]}

γ[y][j] = 1)+ · · ·+r ·1(∧

y∈{τ [#qr−r+1],...,τ [#qr]}

γ[y][j] = 1),

(3)where 1(·) denotes the indicator function, and it equals to 1 when the condition in 1(·)holds; 0 otherwise. Output the ranked encrypted data files (e.g., score(dj ,q) ≥ r means atleast one keyword in q should be contained in each returned data file).

Page 7: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

Sample Title 7

Remarks. As can be seen from Eq. (2), for multi-keyword conjunction queries, the queryresult of the probabilistic coding data structure (i.e., our basic index structure) is equivalentto that of the post-processing of search results of a sequence of single-keyword queries. Formulti-keyword disjunction queries, a keyword contained in both the data dj (during theindex construction) and the query q will be mapped to the same set of buckets in the index.Therefore, score(dj ,q) is simply the number of common buckets mapped by the keywordsin dj and q. Intuitively, the larger number of common buckets, the higher similarity betweenthe data dj and the query q. If a data file dj contains at least one query keyword, there mustbe score(dj ,q) ≥ r. Due to this property, the server can send back all ranked encrypteddata items with score(dj ,q) ≥ r or the ranked encrypted data items with the top-κ highestscores, where κ could be user-specified.

3.2 BEIS: Bucket-Encrypting Index Structure

On the basis of probabilistic coding data structure, we present a bucket-encrypting indexstructure (BEIS), enabling efficient query processing over encrypted data and providing se-mantic security against non-adaptive or adaptive adversaries. BEIS inherits the bucket-basedindex structure (the probabilistic coding data structure) to gain efficient query performanceand encrypts all information in the index structure. Our analysis and experimental evalua-tion show that BEIS enjoys very good performance while guaranteeing practical and highprivacy guarantee.BEIS-I: BEIS with Random Generator. The centerpiece of BEIS is to design a newnon-trivial and non-deterministic encryption function which encrypts the bits of the dataidentifier vector stored in the indexing buckets and a search algorithm which outputs dataidentifiers of the target encrypted data files.

KeyGen(1k): Given a security parameter k, choose a non-zero l-bit integer α and two k-bit strings sk1, λ uniformly at random. Generate sk2 ← SKE.Gen(1k, λ), where SKE is aPCPA-secure symmetric encryption scheme. Output K = (α, sk1, sk2).

Let F : {0, 1}k × {0, 1}∗ → {0, 1}l be a pseudo-random function, which is a polynomial-time computable function that cannot be distinguished from a random function by anypolynomial-time adversaries.

BuildIndex(K,d): After constructing the basic probabilistic coding data structure γ fromdata files d, we encrypt γ as follows: Assume m is the number of buckets, for each γ[y][j](y ∈ [1,m], j ∈ [1,#d]), i.e., the j-th bit of the vector stored in the y-th bucket of γ:

– Let b = γ[y][j] denote the original bit stored in γ[y][j]. We store the encrypted form ofb into γ[y][j] as

γ[y][j] = αb+ ζyj , (4)

where ζyj = F (sk1, y||j||ctr), and ctr denotes a counter that starts from 1. The defaultvalue is ζyj = F (sk1, y||j||1). However, there are two cases we manage differently fromgeneral cases: i) if b = 0 and for position j, ζyj = F (sk1, y||j||1) = 0l, then we increase ctruntil a non-zero ζyj is obtained; ii) if b = 1 and for position j, ζyj = F (sk1, y||j||1) = 1l,then we increase ctr until a non-maximized value of ζyj is obtained. For both cases, theclient records the tuple (y, j, ctr). However, this storage is negligible as our experimentalresults show that these two case rarely if ever occur when l is carefully chosen.

The client outsources the encrypted data files c encrypted under sk2 and γ to the cloudserver.

Trapdoor(K,q): The search token τ consists of two components. The first component con-tains bucket positions mapped by all keywords in a multi-keyword query, denoted by Pq ={hj(wi, sk1)|j ∈ [1, r] and i ∈ [1,#q]}. With Pq and sk1, the client recoveries ζyj foreach y ∈ Pq and j ∈ [1,#d]. Another component of the search token is denoted bystq = (stq,1, . . . , stq,#d), which is generated under Pq according to different query log-ics:

Page 8: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

8 Qian Wang1, Minxin Du1, Kui Ren2, Meiqi He3, and Jian Weng4

1) Conjunction logic query. Compute the search token for the ‘AND’ logic

stq,j =∑y∈Pq

(α+ ζyj). (5)

2) Disjunction logic query. Compute the search token for the ‘OR’ logic

stq,j = (∑

y∈{τ [1],...,τ [r]}

ζyj + rα, . . . ,∑

y∈{τ [#qr−r+1],...,τ [#qr]}

ζyj + rα). (6)

Then the client sends the trapdoor τ = {Pq, stq} and the query logic (‘AND’ or ‘OR’)to the cloud server.

Query(γ, c, τ ): After receiving the trapdoor τ = {Pq, stq}, the cloud server executes thequery according to the user-specified query logic. In the following discussion, we use ID todenote the set of identifiers of the returned data files.1) Conjunction logic query. For each j ∈ [1,#d], add id(dj) into ID if∑

y∈Pq

γ[y][j] = stq,j . (7)

2) Disjunction logic query. For each j ∈ [1,#d], compute the similarity score score(dj ,q)

score(dj ,q) = r · 1(∑

y∈{τ [1],...,τ [r]}

γ[y][j]− stq,j [1] = 0), . . . ,

+r · 1(∑

y∈{τ [#qr−r+1],...,τ [#qr]}

γ[y][j]− stq,j [#q] = 0). (8)

Rank the identifiers in ID based on similarity scores and add id(dj) into ID if score(dj ,q)is larger than or equal to a pre-defined threshold. Finally, the cloud server returns thecorresponding encrypted data files.

The following theorem shows the effectiveness of BEIS, i.e., the returned data files andtheir similarity scores in BEIS are the same as those of the plaintext search over the basicindex structure.

Theorem 1. Searching over BEIS with random generator is equivalent to searching overthe basic index structure.

Proof. See Appendix A.

Before formally proving the non-adaptive security of BEIS construction with randomgenerator, we first introduce three auxiliary definitions which characterize the informationwe allow to leak in our SSE construction. We follow [9] to parameterize our definition witha set of leakage functions that capture precisely what is being leaked by the encryptedindex, the ciphertexts and the query process. Beyond these leakage, no more informationis disclosed. During the query process, the interaction between the client and server isdetermined by file collection d and a sequence of keywords that the client wants to query.An instantiation of such an interaction is called a history [7].

Definition 5. (History H). A g-query history over d is a tuple H = (d,Q) that includes dand a vector of g queries Q = {q1, . . . , qg}.

Definition 6. (Leakage function L1). Given d, L1(d) = {id(c1), . . . , id(c#c), |c1|, . . . , |c#c|, c,γ}.

Definition 7. (Leakage function L2). Given H, L2(H) = {π(d,Q), Ap(d,Q)}, where Q ={q1, . . . , qg}.

Page 9: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

Sample Title 9

In Definition 5, each query qi (i ∈ {1, . . . , g}) is a multi-keyword query. For ease of ex-position, in the following security analysis we consider qi as a single-keyword query, i.e.,qi = {wi}. Note that this does not affect our security guarantee in any way. For the non-adaptive security, it means that the adversary generates the history at once, and before ithas finished generating the history the adversary is not allowed to see the index and thesearch tokens (trapdoors) of any query it chooses.

Theorem 2. The BEIS construction with random generator is secure under the CKA1model.

Proof. See Appendix B.

BEIS-II: BEIS with Homomorphic Generator. The above BEIS construction schemewith the random generator only achieves CKA1 security, and the component stq of thesearch token scales with the number of data files. To enhance the security strength andreduce the communication cost of the query, we next present another BEIS constructionwith the homomorphic generator used for generating the encrypted index in BuildIndex.With the homomorphic generator, the client only needs to use Pq as the search token. Afterreceiving the search token, the cloud server can homomorphically compute the summationof encrypted div’s being hit by Pq and return to the client. By doing so, our scheme achievesa more practical value by migrating the computational task to the cloud and eliminatingthe large communication cost in both search token generation and file ID recovery. Beforeintroducing our protocol, we first briefly give an overview of homomorphic encryption.

Homomorphic Encryption allows certain computations to be carried out on cipher-texts to generate an encrypted result which matches the result of operations performedon the plaintext after being decrypted. Given a message a, we write the encryption of aas JaKpk, or simply JaK, where pk is the public key. Although homomorphic encryption forgeneral operations is prohibitively slow, it is efficient for summation operations. To supportefficient summations, in our paper we use Paillier cryptosystem [18] which is additively ho-momorphic. In the Paillier cryptosystem, if the public key is the modulus n = pq and thebase g ∈ Z∗

n2 , where p and q are two large prime numbers (of equivalent length) chosenrandomly and independently. Then the encryption of a message x ∈ Zn is JxK = gx · rnmod n2, for some random r ∈ Z∗

n. Correspondingly, the decryption of the ciphertext isx = L(JxKφ(n) mod n2) · φ−1(n) mod n, where L(u) = u−1

n . The homomorphic propertyof the Paillier cryptosystem is given by Jx1K · Jx2K = (gx1 · rn1 ) · (gx2 · rn2 ) = gx1+x2(r1r2)

n

mod n2 = Jx1 + x2K. In Paillier cryptosystem, the public (encryption) key is pkp = (n, g),and the private (decryption) key is skp = (φ(n), φ(n)−1 mod n).

The query processing over BEIS construction with homomorphic generator is describedin Fig. 1. In this new index scheme, the basic probabilistic index structure is encryptedusing homomorphic encryption, which is CPA-secure. While the ciphertext length is 2048bits long for each 1024-bit plaintext data value, we carefully insert a certain number of 0’sbetween each bit of div, packing bit values from different rows in a div into one 1024-bitvalue at different bit offsets. This is to ensure that the summation of bit values in one row ofdiv’s in r different buckets will not affect the aggregated values of other rows. Now for eachquery keyword, the server only needs to compute a ciphertext corresponding to the sum ofthe plaintext values and return it to the client. At the client side, individual rows then canbe aggregated by extracting the sum of the desired rows from the appropriate offset in thehomomorphically-computed plaintext. With the help of this design, we can benefit from thehigh security level of homomorphic encryption while maintaining reasonable communicationcost during the query and recovery processes.

Fig. 2 gives an illustrating example of building an encrypted index using homomorphicencryption. Assume that 5 PRFs are used in our scheme, i.e., r = 5, we need to insert 2(i.e., ⌈log2(5 + 1)− 1⌉ = 2) 0’s before each bit of div. As can be seen from the figure, eachoriginal bit in a div is extended to 3 bits. To prepare the div for homomorphic encryption,the bit vector needs to be packed into blocks of 1024-bit. Note that (⌈log2(r + 1)⌉) is not

Page 10: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

10 Qian Wang1, Minxin Du1, Kui Ren2, Meiqi He3, and Jian Weng4

KeyGen(1k): Generate (skp, pkp) for the Paillier cryptosystem. Choose sk1, λ$← {0, 1}k. Generate

sk2 ← SKE.Gen(1k, λ). Output K = (skp, pkp, sk1, sk2).BuildIndex(K,d): % construct the encrypted index based on the basic index structure1. For each γ[i] (i ∈ [1,m]),

– Extract div stored in the γ[i];

– Generate d̂iv by inserting (⌈log2(r + 1)⌉ − 1)-bit 0’s before each bit of div;

– Divide d̂iv into blocks of size ⌊ 1024⌈log2(r+1)⌉⌋ × ⌈log2(r + 1)⌉;

– Pad each block into 1024-bit by attaching 0’s to the end;– Encrypt each 1024-bit plaintext block into a 2048-bit ciphertext block using homomorphic

encryption J·K;– Store the new encrypted div= Jblocki,1K|| . . . ||Jblocki,vK into γ[i], where v denotes the maximum

number of blocks.

2. Compute c← SKE.Enc(sk2,d).3. Output (γ, c).Trapdoor(K,q):1. For each wt (t ∈ [1,#q]), compute {h1(wt, sk1), . . . , hr(wt, sk1)}.2. Generate Pq = {hs(wt, sk1)|t ∈ [1,#q] and s ∈ [1, r]}.3. Output τ = Pq.Query(γ, c, τ ):1. For each wt (t ∈ [1,#q]),

– For each s ∈ [1, r], extract the encrypted div in bucket position hs(wt, sk1) and divide it into asequence of 2048-bit blocks;

– Compute

θt =

r∏s=1

Jblockhs(wt,sk1),1K|| r∏s=1

Jblockhs(wt,sk1),2K|| . . . || r∏s=1

Jblockhs(wt,sk1),vK. (9)

2. Output Θ = {θ1, . . . , θ#q}.Recovery(K,Θ):1. For each θt (t ∈ [1,#q]), parse θt into blocks and decrypt each block using skp.2. Execute the query based on the query logic (‘AND’ or ‘OR’):

1) Conjunction logic query. For each data file dj (j ∈ [1,#d]) associated with the

(⌊ (j−1)·log2(r+1)

1024⌋+ 1)-th block of each θt ∈ Θ, extract each bit bl (l ∈ [1, ⌈log2(r + 1)⌉])

bl = blockj,⌊ (j−1)·log2(r+1)

1024⌋+1]

[cl], (10)

where cl = (j − 1− ⌊ (j−1)·⌈log2(r+1)⌉1024

⌋ × 1024⌈log2(r+1)⌉ )× ⌈log2(r + 1)⌉+ l.

Convert the binary value to the decimal value

Bt = (b1b2 . . . b⌈log2(r+1)⌉)10. (11)

Add id(dj) into ID if and only if for each query keyword wt (t ∈ [1,#q]), Bt = r.2) Disjunction logic query. For each data file dj (j ∈ [1,#d]), compute score(dj ,q) = r ·1(B1 =

r) + · · ·+ r ·1(B#q = r). Add id(dj) into ID if score(dj ,#q) ≥ r. Rank the identifiers in ID basedon the similarity scores.

Fig. 1: The query processing over BEIS with homomorphic encryption.

necessarily a divisor of 1024, e.g., 1024 is not a multiple of 3. Thus, if we simply split thevector, the 1024-th bit, which we denote it by color yellow, will be assigned to the first block.However, it should belong to the next purple bit’s extended block actually. To address thisproblem, we split div into blocks of size ⌊ 1024

⌈log2(r+1)⌉⌋×⌈log2(r+1)⌉ and then pad each block

into 1024 bits (in the figure, we use color red to denote the padding 0’s). Finally, Pailliercryptosystem is applied to generate encrypted div’s.

Page 11: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

Sample Title 11

1

1

1

0

0

0

1

0

0

0

0

0

1

0

0

1

0

0

0

0

0

0

0

0

1

1

0

0

Insert 0's Pad each block into 1024-bit

div!div

Homomorphic encryption

1

0

0

0

0

0

1

0

0

1

0

0

0

0

0

0

0

0

1

0

0

0

New div

2048-bit1024-bit

1024-bit1024

log2 (r +1)

× log2 (r +1) − bit

log2 (r +1) − bit

Fig. 2: An example of building index.

Discussions. In practice, the number of keywords in a multi-keyword query, i.e., #q, isusually limited. Motivated by this fact, in BuildIndex of Fig. 1, for the conjunction querylogic we can insert (⌈log2(#qr + 1)⌉ − 1)-bit 0’s before each bit of div to accommodate thesummation of #qr 1’s. Then, the output of Query can be aggregated into a single encryptedvalue by computing Θ = θ1θ2 · · · θ#q. Finally, in Recovery, it is sufficient to add id(dj) intoID by checking if the corresponding decrypted value is equal to #qr.

Theorem 3. Searching over BEIS with homomorphic generator is equivalent to searchingover the basic index structure.

Proof. See Appendix C.

In the following security analysis, we also use leakage functions that capture preciselywhat is being leaked by the encrypted index, the ciphertexts and the query process. Notethat i) different from BEIS with random generator, BEIS with homomorphic generatorreturns encrypted file IDs to the client in the first round of interaction and ii) the adversaryadaptively generates the history such that the view of it (i.e., the index, the ciphertextsand the search tokens) can only be simulated by the leakage functions. Thus, we redescribeleakage function L2 as follows.

Definition 8. (Leakage function L2). Given d and q, L2(d, q) = {π(d, q), Ap(d, q)}, whereAp(d, q) implicitly includes the intermediate results Ediv(wi) = {ei,1, . . . , ei,r} (i = {1, . . . ,#q})before the recovery of id(cw).

Note that, Ediv(wi) (i = {1, . . . ,#q}) denotes the encrypted div extracted from buckeths(wt, sk1) (s ∈ [1, r]) (see Step 1 of Query). In the single keyword query, we simply writeEdiv(w) = {e1, . . . , er}.

The BEIS with homomorphic generator can be easily proven CKA1-secure. However,because the encryptions in the simulated index γ̃ are replaced by encryptions of randomstrings and the PRFs are replaced by random values, it is difficult for the simulator tosimulate in advance an index that will be consistent with future unknown queries (for theadversary). To overcome this problem, we slightly modify the previous construction and

Page 12: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

12 Qian Wang1, Minxin Du1, Kui Ren2, Meiqi He3, and Jian Weng4

make use of random oracles, allowing the simulator to keep track of dependencies (betweeninformation revealed by the search) and modify the outcome of search during the adversary’sexecution of the query algorithm.

The modified BEIS with homomorphic generator, which is an adaptively secure construc-

tion, is presented as follows. In KeyGen, additionally choose sk3$← {0, 1}k and output K =

(skp, pkp, sk1, sk2, sk3). In the final step of BuildIndex, store the new div= (Jblocki,1K|| . . . ||Jblocki,vK)⊕H(i, sk3) into γ[i], where H : {0, 1}∗ → {0, 1}∗ is a random oracle. In Step 1 of Trapdoor, foreach wt (t ∈ [1,#q]), compute {h1(wt, sk1), . . . , hr(wt, sk1)} and {H(h1(wt, sk1), sk3), . . . , H(hr(wt, sk1), sk3)}.In Step 2, generate Pq and Dq = {H(hs(wt, sk1), sk3)|t ∈ [1,#q] and s ∈ [1, r]}. In Step3, output τ = {Pq, Dq}. In Step 1 of Query, for 1 ≤ s ≤ r, extract div in bucket posi-tion hs(wt, sk1), compute div′ = div ⊕H(hs(wt, sk1), sk3), and divide it into a sequence of2048-bit blocks. In the following Theorem, we prove the modified BEIS with homomorphicgenerator is CKA2-secure in the random oracle model with respect to the leakage definedabove (for ease of exposition, we discuss the single-keyword query case where q = {w}, theproof of the multi-keyword query case follows the same proof strategy).

Theorem 4. The modified BEIS construction with homomorphic generator is adaptivelysecure under the CKA2 model.

Proof. See Appendix D.

4 Supporting Data Dynamics

In practice, supporting efficient data dynamics is highly required by the encrypted index,i.e., data addition and data deletion operations are supported without needing to eitherre-index the entire data collection or make use of generic and expensive dynamization tech-niques. In previous SSE solutions, the update operations may reveal a non-trivial amountof information, e.g., the single data file addition operation in both [8] and [9] will leak thenumber of keywords contained in the data file and the corresponding search tokens/kyewordhashes.

Our constructions can achieve highly privacy-preserving index updates and easily fulfillmore efficient data addition and deletion operations. Fig. 3 and Fig. 4 show the file additionprocess and the file deletion process over BEIS with random generator, respectively. Toadd data file dj , it is only required to build a sub-index γj for dj and simply merge it toγ. To delete data file dj , it is sufficient to generate a sequence of random values ζij ’s andreplace γ[i][j] in γ with it. Fig. 5 and Fig. 7 show the file addition process and the filedeletion process over BEIS with homomorphic generator, respectively. To add data file dj ,it is required to append or merge the sub-index γj for dj in the right position by checkingif #d mod (⌊ 1024

⌈log2(3r+1)⌉⌋) = 0, and carefully update γ. Note that, to support dynamic

data operations, additional 0’s should be inserted between each bit due to the fact that ourmethod for deleting files requires adding an integer t to each bit to indicate that the file isdeleted. In this case, we need to insert/reserve (⌈log2((t + 1)r + 1)⌉ − 1) 0’s in the processof building index. Also note that, t must be larger than 1 for the reason that if we chooset = 1, the bit used to be 0 will be changed into 1. Then, during searching, even if we findBt = r, it cannot be determined whether the file does contain the keyword or the r positionsin the token of the file are all 0’s before deletion. Obviously, inserting more than two 0 bitswill simply bring more computation and communication costs. Therefore, in our scheme, weset t = 2 and (⌈log2(3r + 1)⌉ − 1) 0’s will be inserted before each bit of div.

Fig. 6 shows the process of deleting a document. Suppose that the client wants to deletefile d1, which is in encrypted in the first block of each div with offset 0. To do so, the clientinitializes a 1024-bit vector λ = {0, 0, 1, 0, . . .}, encrypts λ and uses it as the DelToken. Afterreceiving the DelToken, the server multiplies the DelToken with the first block of each divto complete the update. The client can also delete several documents in a batch if they are

Page 13: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

Sample Title 13

encrypted in the same block, for example deleting files d1 and d2. Similarly, the client canalso generate the vector λ = {0, 0, 1, 0, 0, 0, 1, 0, . . .} from appropriate offset.

We next discuss the security of supporting data dynamics. It is easy to prove the non-adaptive security of a dynamic BEIS with random generator when supporting dynamicdata updates. However, to prove a dynamic BEIS with homomorphic generator is adap-tively secure is non-trivial. In the following discussion, we show how to enable a dynamicBEIS with homomorphic generator to achieve adaptive security. To do so, it is required toimplement the random oracle in a different manner. We describe the key differences as fol-

lows. In KeyGen, additionally choose sk3$← {0, 1}k and output K = (skp, pkp, sk1, sk2, sk3).

In the final step of BuildIndex, store the new div= (Jblocki,1K|| . . . ||Jblocki,vK) ⊕ Hi intoγ[i], where Hi = H(ctr1, i, 1, sk3)|| · · · ||H(ctrv, i, v, sk3), H : {0, 1}∗ → {0, 1}∗ is a ran-dom oracle, and ctr denotes a counter that starts from 1. We increase ctr when we needto generate a new H(ctr′, i, j, sk3) and the client needs to record the tuple (i, j, ctr). InStep 1 of Trapdoor, for each wt (t ∈ [1,#q]), compute {h1(wt, sk1), . . . , hr(wt, sk1)} and{Hh1(wt,sk1), . . . , Hhr(wt,sk1)}. In Step 2, generate Pq andDq = {Hhs(wt,sk1)|t ∈ [1,#q] and s ∈[1, r]}. In Step 3, output τ = {Pq, Dq}. In Step 1 of Query, for 1 ≤ s ≤ r, extract divin bucket position hs(wt, sk1), compute div′ = div ⊕ Hi, and divide it into a sequence of2048-bit blocks. In the process of adding file, if #d mod (⌊ 1024

⌈log2(3r+1)⌉⌋) = 0, compute

γj ⊕H(ctrv+1, hs(wt, sk1), v+1, sk3), append γj to γ and update H ′i = Hi||H(ctrv+1, i, v+

1, sk3); otherwise the client downloads the corresponding block firstly, compute Jblocki,pK′ =Jblocki,pK⊕H(ctrp, hs(wt, sk1), p, sk3) execute the update operations according to the rulesin Fig. 5. Finally, Jblocki,pK′ is masked with a new H(ctr′p, hs(wt, sk1), p, sk3) and uploadedto the server. The process of deleting file is carried out similarly. In the following theorem,we prove that the above dynamic BEIS with homomorphic generator is CKA2-secure withrespect to the leakage defined before.

To prove the dynamic BEIS is secure under CKA2 model, we first modify Definition 4 toaccommodate update queries. In RealBEIS,A(k), the adversary makes a polynomial numberof adaptive queries dj and for each query, receives from the challenger either a add token τa ={cj ,γj , id(dj)} or a delete token τd = {λ, id(dj)} such that τa/τd ← UpdateTokenGen(K, dj).In IdealBEIS,A,S(k), the adversary makes a polynomial number of adaptive queries dj andfor each query, the simulator is given L1(dj). The simulator returns an appropriate tokenτ̃a/τ̃d.

Theorem 5. The above dynamic BEIS construction with homomorphic generator is adap-tively secure under the CKA2 model.

Proof. See Appendix E.

Remarks. Compared with most existing SSE solutions, the salient feature of our BEISconstructions is the forward privacy guarantee. Due to the decomposable index structure,the client can build for the data files to be added an encrypted subindex which is then mergedto the main index at the database server. Obviously, the server cannot learn whether or notthe newly-added files contain a keyword that has been searched in the past, let alone thesearch tokens/hashes of keywords shared between data files during the updates. In practice,there is always a trade-off between high privacy guarantee and high search/update efficiency.In our experimental results, we illustrate this trade-off in terms of forward privacy and dataaddition efficiency in Fig. 11.

5 Theoretical Analysis and Experiments

5.1 Query Accuracy Analysis

In our BEIS constructions, there exist a small number of false positives due to probabilisticencodings/decodings. Fortunately, we show that the probability of false decodings can be

Page 14: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

14 Qian Wang1, Minxin Du1, Kui Ren2, Meiqi He3, and Jian Weng4

AddTokenGen(K, dj):1. Compute ca ← Enc(sk2, dj).2. Build the plaintext sub-index γj for dj .3. For each i ∈ [1,m], extract b = γj [i]. Encrypt b and store its encrypted version into γj [i] as

γj [i] = αb+ ζij ,

where ζij = F (sk1, i||j||ctr) by increasing ctr following the same rule in the construction of thestatic index in Section3.2.4. Generate the update token as τa as {cj ,γj , id(dj)}.Update(τa, γ, c):1. Merge/append γj to γ.2. Add cj into c.3. Output {γ′, c′}.

Fig. 3: The file addition process over BEIS with random generator

DelTokenGen(K, dj):1. For dj , re-compute ζij = F (sk1, i||j||ctr) by increasing ctr = ctr + 1. Record the updated tuple(i, j, ctr). Let γj [i] = ζij for all i ∈ [1,m].3. Generate the update token τd as {γj , id(dj)}.Update(τd, γ, c):1. For dj , replace γ[i][j] in γ with γj [i] for all i ∈ [1,m].2. Delete cj from c.3. Output {γ′, c′}.

Fig. 4: The file deletion process over BEIS with random generator

reduced to almost a negligible level by properly choosing system parameters m, r. Formally,the false positive rate of a query can be naturally defined as the ratio of the amount offalsely returned encrypted data files to the number of all encrypted data files returned fromthe cloud server. We have the following theorem to characterize the false positive rate ¬pq.

Theorem 6. The false positive rate of a query q is

¬pq =

{ expϕ(

∧wi∈q divi)+exp , for the conjunction logic

expϕ(

∨wi∈q divi)+exp , for the disjunction logic,

(13)

where ϕ(∧

wi∈q divi) and ϕ(∨

wi∈q divi) denote the number of “1”s in∧

wi∈q divi and∨

wi∈q divi,respectively, and exp denotes the expectation of the number of the falsely returned data filesfor a given query q.

Proof. See Appendix F.

Fig. 8 shows the experimental evaluation of the false positive rate of the query results.In the experiment, we set #d = 105 with #w = 8000. Then, we first build different indicesby using different values of m and r. For each index, we execute two types of queries overthe index and each data point in the figure is the average of 103 queries. As shown from theresults, the false positive rate is extremely small or even negligible by choosing suitable mand r. Note that, to obtain a low and acceptable false positive rate, it is required to properlychoose different settings of m and r when #d and #w vary much. To get rid of rescaling thewhole index during the update process, we conservatively set the parameters m and r to berelatively large values at the cost of minor variation in false positive rate. From a practicalpoint of view, it implies our index constructions achieve desirable privacy guarantee withoutsacrificing the query accuracy and space efficiency much.

Page 15: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

Sample Title 15

AddTokenGen(K, dj):1. Compute cj ← Enc(sk2, dj).2. Build the plaintext sub-index γj for dj .3. For each i ∈ [1,m], insert (⌈log2(3r + 1)⌉ − 1) 0’s before the bit stored in γj [i].4. If #d mod (⌊ 1024

⌈log2(3r+1)⌉⌋) = 0, for each i ∈ [1,m],

– Pad the bit vector in γj [i] to be 1024 bits by attaching 0’s to the end;– Encrypt it using Paillier cryptosystem;

Else, for each i ∈ [1,m],

– Generate a 1024-bit zero vector;– Set the (z + t − 1)-th bit to be γj [i][t] for each t ∈ [1, ⌈log2(3r + 1)⌉], where z = (#d

mod (⌊ 1024⌈log2(3r+1)⌉⌋))×⌈log2(3r+1)⌉+1 denotes the first ‘invalid’ 0 in the last block Jblocki,vK

(see BuildIndex in Fig. 1).

Encrypt the m vectors using Paillier cryptosystem.6. Generate the update token as τa as {cj ,γj , id(dj)}.Update(τa, γ, c):1. If #d mod (⌊ 1024

⌈log2(3r+1)⌉⌋) = 0,

– Append γj to γ.

Else,

– Merge γj to γ by computing Jγj [i]K · Jblocki,vK for each i ∈ [1,m].

2. Add cj into c.3. Output {γ′, c′}.

Fig. 5: The file addition process over BEIS with homomorphic encryption

5.2 Experimental Performance Evaluation

In this section, we implement and evaluate our SSE schemes using a large representativedataset. We first summarize in Table 1 the differences between our schemes and other no-table searchable encryption schemes in the existing literature. Our BEIS schemes can executemulti-keyword boolean search while achieving forward privacy. Note that, forward privacyimplies a strong privacy guarantee during file additions such that the server cannot learnwhether or not the new file contains a keyword we searched for in the past and the searchtokens/keyword hashes of all keywords in the new file are not leaked. In BEIS-II, we cantrade-off the time efficiency of file additions for forward privacy. As shown in Fig. 11, ourexperiments show that if forward privacy is not required, much more efficient file additionsare obtained. As for file deletions, both BEIS-I and BEIS-II do not leak additional infor-mation. To the best of our knowledge, no other schemes achieve all desirable propertiessimultaneously. Note that, BEIS-I and BEIS-II achieve CKA1 and CKA2 security, respec-tively. We emphasize that, as we discussed before the BEIS-I’ storage for ctr informationat the client side can be negligible if the value of ζyj is properly chosen. The theoreticalcomputation cost of our schemes are both O(#d) during the search process. While the costsare linear in the number of documents, the actual time cost of add operations of randomstrings in BEIS-I is almost negligible and that of Paillier encrypt operations in BEIS-II areacceptable in practice. For example, assuming that the variable #d is 500000, the numberof buckets m is 20000, and the number of keywords in a multi-keyword query #q = 5,the space costs of the encrypted index of BEIS-I and that of BEIS-II (Paillier modular is2048bit) are 100GB and 7.5GB, respectively. Besides, nearly 5MB and 1.88MB will be in-troduced by BEIS-I and BEIS-II respectively in terms of bandwidth cost. In both add anddelete operations of BEIS-I and BEIS-II, the computation complexity is O(m), where m is

Page 16: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

16 Qian Wang1, Minxin Du1, Kui Ren2, Meiqi He3, and Jian Weng4

Homomorphic encryption

1

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

Homomorphic encryption{t = 2

Multiply

DelToken

Index on the cloud New index

Server side

Data owner side

2048-bit10 4-bit

(3r +1) − bit

2048-bit

Fig. 6: An example of deleting file.

DelTokenGen(K, dj):1. For dj , generate a 1024-bit zero vector and set the (c⌈log2(3r+1)⌉−1)-th bit to 1, where

ct = (j − 1− ⌊ (j − 1) · ⌈log2(3r + 1)⌉1024

⌋ × 1024

⌈log2(3r + 1)⌉ )× ⌈log2(3r + 1)⌉+ t (12)

for t ∈ [1, ⌈log2(3r + 1)⌉].2. Encrypt the vector using homomorphic encryption to obtain a 2048-bit ciphertext denoted by λ.3. Generate the update token τd as {λ, id(dj)}.Update(τd, γ, c):

1. For each i ∈ [1,m], update γ by computing JλK · Jblocki,pK, where p = ⌊ (j−1)·⌈log2(3r+1)⌉1024

⌋+ 1.2. Delete cj from c.3. Output {γ′, c′}.

Fig. 7: The file deletion process over BEIS with homomorphic encryption

Fig. 8: The false positive rate of the query results.

Page 17: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

Sample Title 17

the number of buckets. As for the communication cost, only 0.2MB will be introduced foreach added/deleted document when m is 20000 in BEIS-I. Note that, although one moreround of communication is needed in BEIS-II during the file delete operation, as many as341 (Paillier modular is 2048) documents can be packed in a packet to be updated in onetime, and thus the communication bandwidth is only 10.24MB when m equals 20000.

Scheme Multi-KW Search Client Index Update Update Forward SecuritySearch Time State Info. Size Cost Leakage Privacy Strength

RBT-SSE [12] No O(#dw · log#d) No O(#d ·#w) O(#w · log#d) ID(w) No CKA2

OXT [13] Yes O(#dw) S-terms O(N +#w) - - No CKA2

DSSE [9] No O(#dw) No O(N +#w) O(|d|) ID(w) No CKA2

SP14 [15] No O(#dw · log3 N) O-sort O(N) O(|d| · log2 N) No Yes CKA2

HK14 [14] No O(N) History O(N +#w) O(|d|) ID(w) No CKA2

BEIS-I Yes O(#d) No O(#d ·m) O(m) No Yes CKA1

BEIS-II Yes O(#d) No O(#d ·m) O(m) No Yes CKA2

Table 1: An overview of complexities for the state-of-the-art searchable encryption schemes in asingle-thread and single-keyword query setting. #dw is the number of data files that contain thekeyword w, #d is the number of data files in the file collection, N is the number of total document-keyword pairs, #w is the number of keywords in the keyword collection, |d| is the number of uniquekeywords in data file d, m is the number of buckets, ID(w) is a deterministic identifier of the keywordw, i.e., the keyword hashes/search tokens of keywords in the file to be updated. Note that, RBT-SSE [12], DSSE [9] and HK14 [14] require the one-round query, OXT [13] and SP14 [15] require themulti-round query due to the use of matching protocol and the o-sort operation, respectively. Forour schemes, BEIS-I requires the one-round query, while BEIS-II requires the two-round query.

The following experiments are implemented in Java 7. Either the operations by the server,or operations performed by the client, were executed on an Intel Core i5-4440M 2.80GHzCPU with 12GB RAM running onWindows 8. To minimize the I/O access time, we randomlygenerate #d #w-bit {0,1} vectors to simulate the database in the main memory during theexperiments. As #d denotes the number of data files (vectors) in the database, the i -th bitof the vector accounts for keyword wi (i ∈ {1, . . . ,#w}). Specifically, if the bit at positioni of a vector is 1, it means this file contains the keyword wi. Note that, all the real-worlddatasets can be pre-processed to such format. In our experiment, we choose #d rangingfrom 213 to 217, and we choose #w ranging from 6000 to 10000. In the following discussion,we use BEIS-I and BEIS-II to denote the dynamic BEIS with random generator and theBEIS with homomorphic generator, respectively.

In the experiments, we set ζ’s to be 80-bit random values to encrypt the index in BEIS-Iand Paillier homomorphic encryption system (with Paillier Modulus 1024 bits or 2048 bits)to encrypt the index in BEIS-II, respectively. In addition, we use the HMAC-SHA1 for PRFsand Random Oracle and the NIST 224p elliptic curves for group operations. For simplicity,we omitted time measurements for encrypting and decrypting the original data files, sincethe time cost of such operations are well studied.

Fig. 9 shows the time costs of index construction process, which is one-time cost. In theleft figure, we set #w = 8000 while in the right figure, we set #d = 215. In BEIS-I, duringthe index construction, we use a 80-bit random number to mask each entry in div’s, whichleads to only a small computation cost, while during the index construction of BEIS-II, theuse of Paillier homomorphic encryption system to encrypt each div leads relatively highercomputation cost.

Fig. 10 shows the query performance of our BEIS schemes. Note that, the time costof query process consists of the time cost of generating trapdoors (tokens) on the clientside and the time cost of file searching on the server side. Note that our BEIS schemes cannaturally support multi-keyword search, without needing to post-process of all results of

Page 18: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

18 Qian Wang1, Minxin Du1, Kui Ren2, Meiqi He3, and Jian Weng4

single-keyword queries. In the left figure, we evaluate the single keyword query performancewith the increase of #d. In the right figure, we evaluate the performance of the multi-keyword query. Since our solutions have nearly the same computation cost for disjunctionlogic and conjunction logic, and thus for simplicity, we only evaluate the performance ofmulti-keyword query in ‘AND’ logic.

Fig. 11 shows the time costs of addition and deletion operations over BEIS-I and BEIS-II. Obviously, the update costs of file updates for BEIS-I increase with the number of datafiles to be updated, and both addition and deletion operations are very efficient. However,due to the unique encryptions of index in BEIS-II, the time costs of file updates remaininvariant. That implies that our BEIS-II better applies to updates of a bunch of data filessimultaneously. During the process of file addition, we measure the time costs of BEIS-IIwith and without achieving forward privacy respectively. For the latter case, we do not needto generate m encrypted value for every buckets. In fact, for each file to be inserted, theclient can only compute a single encrypted value of a 1024-bit vector and send the 2048-bitciphertext to the server together with the position information of the buckets to be changed.We can greatly reduce the time costs for generating adding token if forward privacy is notrequired in practice.

2^13 2^14 2^15 2^16 2^170

2

4

6

8

10

12

14

16x 10

4

Number of data files

Inde

x co

nstr

uctio

n tim

e (s

)

BEIS−IBEIS−II 1024BEIS−II 2048

#w = 8000

6000 7000 8000 9000 100000

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10

4

Number of keywords

Inde

x co

nstr

uctio

n tim

e (s

)

BEIS−IBEIS−II 1024BEIS−II 2048

#d =215

Fig. 9: The average time cost of index construction.

2^13 2^14 2^15 2^16 2^170

1000

2000

3000

4000

5000

6000

7000

8000

9000

Number of data files

Tim

e co

st o

f sin

gle−

keyw

ord

quer

y (m

s)

BEIS−IBEIS−II 1024BEIS−II 2048

#w = 8000

1 2 3 4 50

2000

4000

6000

8000

10000

12000

Number of query keywords in a multi−keyword query

Tim

e co

st o

f mul

ti−ke

ywor

d qu

ery

(ms)

BEIS−IBEIS−II 1024BEIS−II 2048

#d =215

Fig. 10: The average time cost of query process.

6 Conclusion

In this paper, we introduced a suite of new and novel SSE index designs for processing queriesover large-scale encrypted databases. Our index constructions made trade-offs between query

Page 19: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

Sample Title 19

1 10 20 30 40 5010

0

101

102

103

104

105

106

Number of files to be added

Tim

e co

st o

f file

add

ition

(m

s)

BEIS−IBEIS−II 2048 (without forward privacy)BEIS−II 2048 (with forward privacy)

#d =215#w = 8000

1 10 20 30 40 500

100

200

300

400

500

600

700

800

Number of files to be deleted

Tim

e co

st o

f file

del

etio

n (m

s)

BEIS−IBEIS−II 2048

#w = 8000

#d =215

Fig. 11: The average time cost of file updates.

efficiency and query privacy, with flexible and comprehensive query functionalities. Throughrigorous security analysis under strong security model and extensive experiments on real-word datasets, we demonstrated the effectiveness and practicality of our constructions.

References

1. Hacigms, H., Lyer, B., Li, C., Mhrotra, S.: Executing sql over encrypted data in the database-server-provider model. In: SIGMOD, ACM (2002) 216–227

2. Kamara, S., Lauter, K.: Cryptographic cloud storage. In: Financial Cryptography Workshops.(2010) 136–149

3. Gonzalez, L.M.V., Rodero-Merino, L., Caceres, J., Lindner, M.A.: A break in the clouds:Towards a cloud definition. Computer Communication Review (CCR) 39(1) (2009) 50–55

4. Song, D.X., Wagner, D., Perrig, A.: Practical techniques for searches on encrypted data. In:IEEE Symposium on Security and Privacy, IEEE (2000) 44–55

5. Goh, E.J.: Secure indexes. IACR (2003)6. Chang, Y., Mitzenmacher, M.: Privacy preserving keyword searches on remote encrypted data.

In: ACNS. (2005) 442–4557. Curtmola, R., Garay, J.A., Kamara, S., Ostrovsky, R.: Searchable symmetric encryption: Im-

proved definitions and efficient constructions. In: ACM CCS, ACM (2006) 79–888. van Liesdonk, P., Sedghi, S., Doumen, J., Hartel, P.H., Jonker, W.: Computationally efficient

searchable symmetric encryption. In: Workshop on Secure Data Management, ACM (2010)87–100

9. Kamara, S., Papamanthou, C., Roeder, T.: Dynamic searchable symmetric encryption. In:ACM CCS, ACM (2012) 965–976

10. Li, J., Wang, Q., Wang, C., Cao, N., Ren, K., Lou, W.: Fuzzy keyword search over encrypteddata in cloud computing. In: INFOCOM, IEEE (2010) 441–445

11. Cao, N., Wang, C., Li, M., Ren, K., Lou, W.: Privacy-preserving multi-keyword ranked searchover encrypted cloud data. In: INFOCOM, IEEE (2011) 829–837

12. Kamara, S., Papamanthou, C.: Parallel and dynamic searchable symmetric encryption. In:Financial Cryptography. (2013) 258–274

13. Cash, D., Jarecki, S., Jutla, C.S., Krawczyk, H., Rosu, M.C., Steiner, M.: Highly-scalablesearchable symmetric encryption with support for boolean queries. In: CRYPTO. (2013) 353–373

14. Hahn, F., Kerschbaum, F.: Searchable encryption with secure and efficient updates. In: Proc.of ACM CCS’14. (2014) 310–320

15. Stefanov, E., Papamanthou, C., Shi, E.: Practical dynamic searchable encryption with smallleakage. In: Proc. of NDSS’14. (2014)

16. Wong, W.K., Cheung, D.W.L., Kao, B., Mamoulis, N.: Secure knn computation on encrypteddatabases. In: SIGMOD, ACM (2009) 139–152

17. Wang, B., Yu, S., Lou, W., Hou, Y.T.: Privacy-preserving multi-keyword fuzzy search overencrypted data in the cloud. In: Proc. of IEEE INFOCOM’14. (2014)

18. Paillier, P., Pointcheval, D.: Efficient public-key cryptosystems provably secure against activeadversaries. In: Proc. of ASIACRYPT’99. (1999) 165–179

Page 20: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

20 Qian Wang1, Minxin Du1, Kui Ren2, Meiqi He3, and Jian Weng4

A Proof of Theorem 1

Proof. i) For the conjunction logic query, as shown in Eq. (2), divτ [j] = 1 if and only ifγ[y][j] = 1 for each y ∈ Pq. Similarly, in BEIS, for ∀j ∈ [1,#d], if there exists y ∈ Pq

such that the original plaintext bit value of γ[y][j] is ‘0’, then∑

y∈Pqγ[y][j] < stq,j since

γ[y][j] = α · 0 + ζyj < α · 1 + ζyj . Therefore, id(dj) is added to ID if∑

y∈Pqγ[y][j] = stq,j ,

i.e., γ[y][j] = 1 ∀y ∈ Pq.ii) For the disjunction logic query, let byj be the original bit value stored in γ[y][j] ∀y ∈ Pq

and ∀j ∈ [1,#d]. ∀j ∈ [1,#d], we have

score(dj ,q) = r · 1(∑

y∈{τ [1],...,τ [r]}

γ[y][j]− stq,j [1] = 0), . . . ,

r · 1(∑

y∈{τ [#qr−r+1],...,τ [#qr]}

γ[y][j]− stq,j [#q] = 0)

= r · 1(∑

y∈{τ [1],...,τ [r]}

(αbij + ζyj)− (∑

y∈{τ [1],...,τ [r]}

ζyj + rα) = 0), . . . ,

r · 1(∑

y∈{τ [#qr−r+1],...,τ [#qr]}

(αbij + ζyj)− (∑

y∈{τ [#qr−r+1],...,τ [#qr]}

ζyj + rα) = 0)

= r · 1(∑

y∈{τ [1],...,τ [r]}

bijα− rα = 0), . . . ,

r · 1(∑

y∈{τ [#qr−r+1],...,τ [#qr]}

bijα− rα = 0)

(14)

We can conclude that score(dj ,q) is proportional to the similarity score shown in Eq. (3).This completes the proof.

B Proof of Theorem 2

Proof. We describe a polynomial-time simulator S such that for any PPT adversary A,the outputs of RealBEIS,A(k) and IdealBEIS,A,S(k) are computationally indistinguishable.Consider the simulator S that simulates an encrypted index γ̃, a ciphertext collection c̃ andany search token τ̃ as follows:

– (Setting up internal data structure) Given L1(d), let iγ be an empty bucket array ofsize m, where m is the number of buckets in γ. Let ϕs be a bijection mapping bucketpositions in iγ to positions in the simulated index γ̃.

– (Simulating γ̃) Given L1(d) and L2(H), S generates an empty bucket array of size m,and randomly chooses a non-zero l-bit integer α̃. For each query w ∈ Q = {w1, . . . , wg},S randomly selects r buckets in iγ and marks them with id(w) (for ease of exposition,we assume each keyword has a unique ID, and a bucket may be marked with multiplekeyword IDs due to the probabilistic mapping). For each bucket i ∈ [1,m] and each

data file j ∈ [1,#d] select a l-bit ζ̃ij uniformly at random following the same rule inthe construction of the static index in Section3.2. If the bucket i is marked with anykeyword’s id(w), and we can know from the access pattern Ap(d,Q) that data file j

contains this keyword. Then we set γ̃[i][j] = α̃+ ζ̃ij . Otherwise, set γ̃[i][j] = ζ̃ij . Outputγ̃.

– (Simulating τ̃ ) Given L2(H), S generates the simulated search token as

τ̃ = {P̃w = ϕs(id(w)), s̃tw},

Page 21: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

Sample Title 21

where

s̃tw,j =∑y∈P̃w

ζ̃yj + rα̃.

– (Simulating c̃) given L1(d), S generates sk2 ← SKE.Gen(1k, λ′), where λ′ is a randomk-bit string. For all i ∈ {1, . . . ,#d}, it computes c̃i ← SKE.Enc(sk2, {0, 1}|ci|).

It follows by the above construction that searching on γ̃ using τ̃ will yeild the expectedsearch outcomes as searching over γ using τ .

In summary, the indistinguishability of γ̃ from γ follows from the from the pseudo-randomness of F . The indistinguishability of τ̃ from τ follows from the pseudo-randomnessof h1, . . . , hr. The indistinguishability of c̃ from c follows from the CPA-security of SKE.

C Proof of Theorem 3

Proof. i) For the conjunction logic query, as shown in Eq. (11), for query keyword wt (t ∈[1,#q]), we consider data file dj (j ∈ [1,#d]). It is easy to see that Bt = r if and only ifγbasic[s][j] = 1 for each bucket position s ∈ {τ [t][1], . . . , τ [t][r]} because

Bt =∑

s∈{τ [t][1],...,τ [t][r]}

γbasic[s][j] (15)

Then, if for all query keywords wt (t ∈ [1,#q]), Bt = r, we conclude that data file djcontains all the keywords in query. Thus, id(dj) is added to ID.ii)For the disjunction logic query, we have

score(dj ,q) = r · 1(B1 = r) + · · ·+ r · 1(B#q = r)

= r · 1(∑

s∈{τ [1],...,τ [r]}

γ[s][j] = r) + · · ·+ r · 1(∑

s∈{τ [#qr−r+1],...,τ [#qr]}

γ[s][j] = r)

= r · 1(∧

s∈{τ [1],...,τ [r]}

γ[s][j] = 1) + · · ·+ r · 1(∧

s∈{τ [#qr−r+1],...,τ [#qr]}

γ[s][j] = 1)

(16)

We can conclude that score(dj ,q) is equal to the similarity score shown in Eq. (3). Thiscompletes the proof.

D Proof of Theorem 4

Proof. We describe a polynomial-time simulator S such that for any PPT adversary A,the outputs of RealBEIS,A(k) and IdealBEIS,A,S(k) are computationally indistinguishable.Consider the simulator S that adaptively simulates an encrypted index γ̃, a ciphertextcollection c̃ and any search token τ̃ as follows:

– (Setting up internal data structure) given L1(d) = {id(c1), . . . , id(c#c), |c1|, . . . , |c#c|, c,γ},let iγ be an empty bucket array of sizem. Let ϕs be a bijection mapping bucket positionsin iγ to bucket positions in the simulated index γ̃.

– (Simulating γ) given L1(d), S generates an empty bucket array of size m. It thengenerates (sk′p, pk

′p) for the Paillier cryptosystem and, for each bucket, it chooses v

random strings {0, 1}1024, encrypts each of them using homomorphic encryption under(sk′p, pk

′p) and stores the encrypted blocks in the bucket. Finally, for each bucket, it

chooses a random string of size (v× 2048)-bit and updates the ciphertexts in the bucketby XORing them with the random string. Output γ̃.

Page 22: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

22 Qian Wang1, Minxin Du1, Kui Ren2, Meiqi He3, and Jian Weng4

– (Simulating c) given L1(d), S generates sk4 ← SKE.Gen(1k, λ′), where λ′ is a randomk-bit string. For all i ∈ {1, . . . ,#d}, it computes c̃i ← SKE.Enc(sk4, {0, 1}|ci|).

– (Simulating τ ) given L2(d, w) = {π(d, w), Ap(d, w)} (note that q = {w} for the singlekeyword query), S works as follows. Assume each query keyword w has a unique ID,i.e., id(w). There are two possible cases:Case 1. If the keyword w has never appeared before, in which case w/id(w) has notappeared in the search pattern π(d, w), S updates its iγ to mark a sequence of positionsfor w/id(w). To do this, it checks if any element of Ediv(w) in the access pattern hasbeen stored in a bucket of iγ. If the answer is Yes, it adds and marks w/id(w) to theposition of this bucket. Assume there are δ such overlapping buckets, the simulatorchooses (#Ediv(w)− δ) unused buckets in iγ at random, marks them with w/id(w) andstores in each of these buckets an encrypted div from Ediv(w) in the access pattern.Case 2. If the keyword w has appeared before, S searches iγ and locates for bucketsmarked with id(w) (note that, a bucket may be marked with multiple keyword IDs) andretrieves the encrypted div’s previously stored for w.

For both cases, the simulator returns the simulated search token τ̃ in the following form:

τ̃ =(ϕs(id(w)),H(γ̃[ϕs(id(w))], s̃k3)

),

where ϕs(id(w)) denotes bucket positions in γ̃ mapped from bucket positions markedwith id(w) in iγ. Thus, γ̃[ϕs(id(w))] denotes the set of simulated encrypted div’s in bucket

positions ϕs(id(w)) of γ̃. Here, s̃k3 is a k-bit random string. Given query (γ̃[ϕs(id(w))], s̃k3)on the random oracle, the simulator programs the response of the random oracle andreturns

H(γ̃[ϕs(id(w))], s̃k3) = γ̃[ϕs(id(w))]⊕ iγ[id(w)],

where iγ[id(w)] denotes the set of encrypted div’s extracted from Ediv(w) in the accesspattern.In summary, the indistinguishability of γ̃ from γ follows from the CPA-security of Pail-lier’s cryptosystem. The indistinguishability of c̃ from c follows from the CPA-securityof SKE. The indistinguishability of τ̃ from τ follows from the pseudo-randomness ofh1, . . . , hr.

E Proof of Theorem 5

Proof. – (Simulating add tokens τ̃a) given L1(dj), S generates sk4 ← SKE.Gen(1k, λ′),where λ′ is a random k-bit string. Computes c̃j ← SKE.Enc(sk4, {0, 1}|cj |). S generatesan empty bucket array of size m. For each bucket, it chooses a random strings {0, 1}1024,encrypts each of them using homomorphic encryption under (sk′p, pk

′p) and stores the

encrypted blocks in the bucket. Finally, for each bucket, it chooses a random stringof size 2048-bit and updates the ciphertexts in the bucket by XORing them with therandom string. Output τ̃a = (c̃j , γ̃j , id(dj)).After returning the token, the simulator updates its index as follows.1. If #d mod (⌊ 1024

⌈log2(3r+1)⌉⌋) = 0,

• Append γ̃j to γ̃.Else,• Substitute the last block with γ̃j for each i ∈ [1,m].

2. Add c̃j into c̃.3. Output {γ̃′, c̃′}.

– (Simulating delete tokens τ̃d) given L1(dj), S chooses a random strings {0, 1}1024, en-crypts each of them using homomorphic encryption under (sk′p, pk

′p) and stores the

encrypted blocks for each bucket. Finally, for each bucket, it chooses a random string

Page 23: Efficient Indexing and Query Processing for Secure Dynamic ...nisplab.whu.edu.cn/paper/SSE-TR-2015.pdf · single-keyword search, the direct extension to multi-keyword search requires

Sample Title 23

of size 2048-bit and updates the ciphertexts in the bucket by XORing them with therandom string. Output τ̃d = (λ̃, id(dj)).

After returning the token, the simulator updates its index by substituting Jb̃locki,pK withλ̃. Finally, S deletes c̃j from c̃.

F Proof of Theorem 6

Proof. We compute exp by first discussing the tricky case where q ⊂ w. Essentially, thefalse returning of each encrypted data is independent of other encrypted data files. Thus,for each data dj ∈ d, we use pj to denote the probability of that dj is falsely returned.

For ease of presentation, we start from a single keyword wi (wi ∈ q but wi ̸∈ dj). Sincewi ∈ w, divi should have been encoded into the r buckets γ[hi(wi, sk1)] for i ∈ [1, r].Based on our bucket based encoding/decoding method, The false decoding of divi hap-pens if and only if all r buckets are encoded by data identifier vectors of other key-words besides that of wi. Without loss of generality, we assume there are r additionaldata identifier vectors {d̂iv1, . . . , d̂ivr} which are repeatedly encoded into the r buckets{γ[h1(wi, sk1)], . . . ,γ[hr(wi, sk1)]}, respectively. Here, each d̂ivi might be the joint of mul-tiple data identifier vectors. In the decoding process, divτ is decoded from the r buckets{γ[h1(wi, sk1)], . . . , γ[hr(wi, sk1)]} = {divi

∨d̂iv1, . . . , divi

∨d̂ivr}. We denote p̂j as the

probability of wi which is falsely considered as a membership of dj . Let nj be the numberof distinct keywords in dj , we have

p̂j = (1− (1− 1

m)rnj )r ≈ (1− e

−rnjm )r. (17)

Now we discuss pj as follows: 1) The conjunction logic query. For ∀dj ∈ d, dj will befalsely returned if and only if all keywords in q are hit by dj while there exists at least onekeyword in q which is actually not contained in dj . By contradiction, dj will not be falselyreturned if there always exists a keyword in q which is not contained and hit by dj . Thus,we have pj = 1 − (1 − p̂j) = p̂j . 2) The disjunction logic query. Similarly, for ∀dj ∈ d, djwill be falsely returned if there exists at least one keyword in q that is hit by dj , where allkeywords in q are not contained in dj . By contradiction, dj will not be falsely returned ifnone of the keywords in q is hit by dj . Thus, we have pj = 1− (1− p̂j)

#q.Then we can compute the expectation as

exp =∑dj∈d̂

pj , (18)

where d̂ denotes the collection of data files which might be falsely returned. In the conjunc-tion logic query, #d̂ = #d−ϕ{

∧wi∈q divi}, and #d̂ = #d−ϕ{

∨wi∈q divi} in the disjunction

logic query. When considering the case that wi ̸∈ w and the false positive decoding happens,the returning of any data files is a false decoding. In this case, we compute exp over d, i.e.,d̂ = d. Based on exp, we can easily compute ¬pq as Eq. (13).