Page 87 - Read Online
P. 87
Page 80 Salmani et al. J Surveill Secur Saf 2020;1:79–101 I http://dx.doi.org/10.20517/jsss.2020.16
1 INTRODUCTION
With the advent of cloud computing, data owners are motivated to outsource their personal data from lo-
cal repositories to the cloud to increase flexibility, convenience and to reduce the burden of the local data
maintenance costs. However, the privacy of the data must be preserved against unwanted, unauthorized, and
malicious accesses from outside attackers and unapproved insiders, including systems such as cloud servers.
To access the data, unlike external attackers that must develop smart and subtle ways to circumvent security
firewalls and access-control mechanisms, the cloud accesses data directly, which is the reason it is considered
a primary privacy threat.
Encrypting the data before it is transferred to the cloud protects the data to some extent, but at the cost of
efficiency. Posing queries on the encrypted data so it is searchable is one the current open challenges. Even
though Searchable Symmetric Encryption (SSE) enables search on ciphertext, these schemes support only
Boolean keyword search, i.e., whether a keyword exists in a document or not [1–7] . Moreover, these schemes
[8]
must strike a balance between security and efficiency . Consequently, the documents retrieved may be in-
correct or incomplete which consumes more time, computation power, and network bandwidth. Conversely,
disclosing more information to the cloud to increase accuracy and query efficiency, leads to further privacy
exposure.
Multi-keyword ranked search over encrypted cloud data (MRSE) schemes have been proposed [9–11] to (1)
acquire result relevance ranking, instead of returning undifferentiated results; and (2) improve search result
accuracy by supporting multiple keywords search instead of single keyword search which often yields in unac-
ceptably coarse results. Consequently, each keyword refines the results further. Moreover, the cloud is able to
rank the results based on their relevance and returns the top-k most relevant data items which causes less net-
workbandwidthconsumption, increaseddatauser’ssatisfaction, andishighlydesirableinthe“pay-as-you-use”
[9]
cloud paradigm .
Although MRSE schemes [9–11] are helpful and allow a user to search and retrieve the documents of interest,
they suffer from private information leakage. Query algorithms for existing MRSE schemes are mostly deter-
ministic, which means the same keyword can be used for the same type of queries. Thus, the attacker is able to
determine whether the keywords retrieved by two queries are the same [12] (search pattern attack). Moreover,
some words often co-occur with other words so the attacker can determine keywords with similar term of
distribution (co-occurrence attack) [11] . For instance, the bigram “of the” occurs much more frequently than
any other bigrams in English language [13] ; and possessing some auxiliary knowledge can disclose more infor-
mation about the co-occurring terms. For example, the term “united” is very likely to co-occur with “states”
in White House official paperwork. Thus, identifying a term is not difficult when the attacker knows the corre-
sponding co-occurring term in the ciphertext. Further, documents in the same category share a considerable
overlap of terms and keywords so information leakage in one document can lead to privacy violation in other
documents. In addition, during each search, tracking the keywords and the corresponding retrieved docu-
ments can leak more information (access pattern attack). Thus, to preserve the data privacy, this information
must be hidden from untrusted, unapproved, and unauthorized parties.
This paper tackles the problem of leakless privacy preserving multi-keyword ranked search over encrypted
cloud data (LRSE), and we solve the problem of search pattern, and co-occurrence (for the first time among
multi-keywords SSE schemes) private information leakage. To capture the relevance between documents and
thesearchquery, weemploysecureinner-productsimilaritythatprovidessufficientsearchaccuracy [14] . Inour
approach, the documents and the search queries are described as binary vectors where each bit represents the
existence of the corresponding keyword in the document/search query. Thus, the similarity can be measured
bytheinner-productofthequeryanddocumentvector [14] . Thedistributionofthekeywordsinthedocuments
is not uniform and decreases uncertainty (entropy) of the accessed documents. To address this problem, the