Page 99 - Read Online
P. 99

Page 92                 Salmani et al. J Surveill Secur Saf 2020;1:79–101  I http://dx.doi.org/10.20517/jsss.2020.16



               5.2  Privacy attacks
               In Section 3 we identified two private information leakage (privacy attacks). In this section we explain how
               LRSE prevents search pattern and co-occurrence attack. Our schemes are not designed to preserve the user
               privacy against access pattern attack and we leave it as future work.


               InLRSE with each query the data user is able to randomlychoose a portion of the ciphertextsfor each keyword.
               Thus, even if consecutive queries share some of their keywords, the cloud is not able to find a pattern between
               the queries due to using multiple versions of ciphertexts in each query. Further, co-occurring terms appear
               with different ciphertexts, so, finding the co-occurring terms becomes significantly more difficult for the cloud
               (see Section 4).


               Assume, φ i is the number of available ciphertexts for keyword w i, and in each query the query generator uses
                                                                                                     (   )
               β percent of the φ i. The number of possible permutations Γ that the query generator can employ is: Γ =  φ i  .
                                                                                                      β×φ i
               For example, if we have φ i = 10 available ciphertexts for keyword w i, and the query generator employs 40
                                                                                                 ( )
                                                                                                  10
               percent of the ciphertexts in each query (β = 40%) the number of possible permutation is: Γ =  = 210.
                                                                                                  4
               This means, there are 210 distinct possible choices for the query algorithm to ask for the same keyword. In
               other words, the probability of having 2 queries with same permutation of the ciphertexts for w i is 0.047% (  1
                                                                                                       210
               ). Note that the main idea is cloud sees all of the keywords are queried with the same probability even if the
               user starts requesting for the same keyword multiple times.
               5.3  Result completeness
               To apply the LRSE scheme we first extract keywords from documents in corpus. Technically, keywords with
               highfrequencygetextractedconsideringsomelinguisticknowledgetoavoidstopwordssuchas“the”, “for”, “if”,
               and etc. [23] . We then generate the φ vector based on the minimum value among the average of each keyword
               in the entire corpus. Further, we assume the random number generator is fair.

               Considering all of the aforementioned, there is still a rare chance of incomplete results on the paper. Assume
               the term frequency of the keyword w i in document D is f i, and the number of the available ciphertexts for the
               corresponding keyword is φ i. As long as f i ≥ φ i there is no problem (case 1), because the corresponding C
               (encrypted D) contains all of the available ciphertexts and no matter which ciphertext versions are employed
               in the user query, C (and consequently D) will be placed among the possible results.


               If f i ≤ φ i the user query may contain encrypted versions of w i that do not exist in D (case 2). Recall that
               along with the query, the user also sends the parameter k to retrieve only the top-k documents. In case 2 if k
               is relativity less than the number of documents that includes w i there is still no problem because D’s relevance
               score to the query is very low and even if D had all of the encrypted versions of w i, it would not be placed
               among the top-k files (case 3). Note that number of documents that f i ≤ φ i should be very low, otherwise the
               keyword extraction algorithm would not choose w i as a keyword in the first place.

               The problem occurs when the user asks for all of the documents that includes keyword w i (case 4). In this
               condition, the results may be incomplete, because the user employs a portion of the available ciphertexts which
               may not be used in D. For example, consider we have a corpus with four documents (d 1 , d 2 , d 3 , d 4), and the
               term frequency of keyword w i in each document is (25,32,6,29). Also, consider that the number of available
               ciphertexts φ i = 10 and the portion is set to forty percent for query q (e.g., forty percent of the available
               ciphertexts are employed to generate q by the query generator). Thus, the query generator randomly selects
               four versions of available ciphertexts (forty percent of φ i). No matter what versions of the ciphertexts are
               employed in the query q, d 1, d 2, and d 4 will be placed among the possible results because the term frequency of
               keyword w i in those documents are greater than φ i, so those documents possess all of the available ciphertexts
   94   95   96   97   98   99   100   101   102   103   104