Page 99 - Read Online
P. 99
Page 92 Salmani et al. J Surveill Secur Saf 2020;1:79–101 I http://dx.doi.org/10.20517/jsss.2020.16
5.2 Privacy attacks
In Section 3 we identified two private information leakage (privacy attacks). In this section we explain how
LRSE prevents search pattern and co-occurrence attack. Our schemes are not designed to preserve the user
privacy against access pattern attack and we leave it as future work.
InLRSE with each query the data user is able to randomlychoose a portion of the ciphertextsfor each keyword.
Thus, even if consecutive queries share some of their keywords, the cloud is not able to find a pattern between
the queries due to using multiple versions of ciphertexts in each query. Further, co-occurring terms appear
with different ciphertexts, so, finding the co-occurring terms becomes significantly more difficult for the cloud
(see Section 4).
Assume, φ i is the number of available ciphertexts for keyword w i, and in each query the query generator uses
( )
β percent of the φ i. The number of possible permutations Γ that the query generator can employ is: Γ = φ i .
β×φ i
For example, if we have φ i = 10 available ciphertexts for keyword w i, and the query generator employs 40
( )
10
percent of the ciphertexts in each query (β = 40%) the number of possible permutation is: Γ = = 210.
4
This means, there are 210 distinct possible choices for the query algorithm to ask for the same keyword. In
other words, the probability of having 2 queries with same permutation of the ciphertexts for w i is 0.047% ( 1
210
). Note that the main idea is cloud sees all of the keywords are queried with the same probability even if the
user starts requesting for the same keyword multiple times.
5.3 Result completeness
To apply the LRSE scheme we first extract keywords from documents in corpus. Technically, keywords with
highfrequencygetextractedconsideringsomelinguisticknowledgetoavoidstopwordssuchas“the”, “for”, “if”,
and etc. [23] . We then generate the φ vector based on the minimum value among the average of each keyword
in the entire corpus. Further, we assume the random number generator is fair.
Considering all of the aforementioned, there is still a rare chance of incomplete results on the paper. Assume
the term frequency of the keyword w i in document D is f i, and the number of the available ciphertexts for the
corresponding keyword is φ i. As long as f i ≥ φ i there is no problem (case 1), because the corresponding C
(encrypted D) contains all of the available ciphertexts and no matter which ciphertext versions are employed
in the user query, C (and consequently D) will be placed among the possible results.
If f i ≤ φ i the user query may contain encrypted versions of w i that do not exist in D (case 2). Recall that
along with the query, the user also sends the parameter k to retrieve only the top-k documents. In case 2 if k
is relativity less than the number of documents that includes w i there is still no problem because D’s relevance
score to the query is very low and even if D had all of the encrypted versions of w i, it would not be placed
among the top-k files (case 3). Note that number of documents that f i ≤ φ i should be very low, otherwise the
keyword extraction algorithm would not choose w i as a keyword in the first place.
The problem occurs when the user asks for all of the documents that includes keyword w i (case 4). In this
condition, the results may be incomplete, because the user employs a portion of the available ciphertexts which
may not be used in D. For example, consider we have a corpus with four documents (d 1 , d 2 , d 3 , d 4), and the
term frequency of keyword w i in each document is (25,32,6,29). Also, consider that the number of available
ciphertexts φ i = 10 and the portion is set to forty percent for query q (e.g., forty percent of the available
ciphertexts are employed to generate q by the query generator). Thus, the query generator randomly selects
four versions of available ciphertexts (forty percent of φ i). No matter what versions of the ciphertexts are
employed in the query q, d 1, d 2, and d 4 will be placed among the possible results because the term frequency of
keyword w i in those documents are greater than φ i, so those documents possess all of the available ciphertexts