Page 103 - Read Online
P. 103

Page 96                 Salmani et al. J Surveill Secur Saf 2020;1:79–101  I http://dx.doi.org/10.20517/jsss.2020.16


               results. In MRSE standard deviation (σ) plays an important role which can effect the result accuracy. The
               bigger the σ is, the less result accuracy we have and it becomes more difficult for the cloud server to obtain
               information about the user data. Although the σ is not a parameter to be set up in LRSE, we calculate the
               standard deviation of the document vectors after applying LRSE. The average of σ for document vectors in
               lRSE is 1.34 with minimum of 0.97. Compared to MRSE (when the σ is 1) LRSE is 7% more accurate than
               MRSE, and this difference becomes more if the average of LRSE’s σ decreases to 1.

               In both LRSE and MRSE as the top-k increases, the result accuracy decreases. In MRSE, this is because of
               the dummy keywords which can effect the similarity scores (dummy keywords may reduce some document
               scores which are in the real top-k results or increase the score of some documents out of the real top-k results).
               In LRSE, with increasing top-k, documents with less relevance to the query are placed in the result set. The
               frequency of the queried keywords in some documents is insufficient to cover all of the available ciphertexts for
               the corresponding keyword. Thus when the query asks for the missed ciphertext versions, those documents
               do not get into the resultant set even when they contain the required keywords. In Section 5.3 we propose to
               inject the missed ciphertext versions to prevent this problem. However, the results show that LRSE loses less
               than 1% accuracy from top-3 to top-30, which is tolerable. More importantly, this happens to less relevant
               documents to the query.


               6.2  Document Vectors
               6.2.1  Entropy of Document Vectors
               We employed Shanon entropy to calculate entropy of the original and LRSE document vectors: H(V) =
                 ∑ n
               −  i=1  p i log p i, where V is the document vector, n is the number of keywords, and p i is probability of key-
                          2
               word i.

               TocalculateLRSEentropyprogress,wedefineameasureasentropyimprovement H imp = (H(V li )−H(V oi ))/H(V oi ),
               where H(V li )istheentropyofthedocument i vectorinLRSEand H(V oi )istheentropyoftheoriginaldocument
               vector of the same document.


               In Section 5.1 we proved that the entropy of document vectors which are generated by LRSE are greater than
               or equal to the entropy of original vectors.The simulation results emphasizes our theorem and shows at least a
               25%entropyimprovementinallofthedocumentsandinsomedocumentsaround90%. Figure6demonstrates
               the first 20 documents entropy improvement.

               Note that, some documents such as “Document6” in Figure 6 may possess a high frequency of some keywords
               because specifically discuss a special topic. For example, legal terminologies are used heavily in the congress
               documents which increases their frequencies and drastically reduces the entropy of the document vectors
               and threatens owner/user privacy. In LRSE, we break down these high frequency occurrences to a couple of

               frequenciesintheaveragefrequencyrange(see Section4.1). Forexample, assumethefrequency ofkeyword w i
                                                                 72
               in document D j is 72 and the threshold is τ = 25, thus φ i = d e = 3. Hence, LRSE divides the w i frequency to
                                                                 25
               3 smaller parts (say 22,24,26) which are close to the frequency average (25), and then generates 3 ciphertexts
               using the chaining notion for each part. Because of this LRSE feature, our observation and result simulation
               show a 90% entropy improvement in some of the documents.

               6.2.2  Standard Deviation of Document Vectors
               A low standard deviation (σ) represents that most of the keyword frequencies are very close to the average
               and consequently to each other. Note that, the keywords can be deduced or identified in a strong attack model
               that the cloud server is equipped with more knowledge such as the term frequency statistics of the document
               collection [16] . For example, the frequency of economic terminologies is much higher than the other keywords
               in a budget document. Thus, the more keyword frequencies become closer to each other the more difficult
   98   99   100   101   102   103   104   105   106   107   108