Page 103 - Read Online

P. 103

Page 96 Salmani et al. J Surveill Secur Saf 2020;1:79–101 I http://dx.doi.org/10.20517/jsss.2020.16

results. In MRSE standard deviation (σ) plays an important role which can effect the result accuracy. The
bigger the σ is, the less result accuracy we have and it becomes more difficult for the cloud server to obtain
information about the user data. Although the σ is not a parameter to be set up in LRSE, we calculate the
standard deviation of the document vectors after applying LRSE. The average of σ for document vectors in
lRSE is 1.34 with minimum of 0.97. Compared to MRSE (when the σ is 1) LRSE is 7% more accurate than
MRSE, and this difference becomes more if the average of LRSE’s σ decreases to 1.

In both LRSE and MRSE as the top-k increases, the result accuracy decreases. In MRSE, this is because of
the dummy keywords which can effect the similarity scores (dummy keywords may reduce some document
scores which are in the real top-k results or increase the score of some documents out of the real top-k results).
In LRSE, with increasing top-k, documents with less relevance to the query are placed in the result set. The
frequency of the queried keywords in some documents is insufficient to cover all of the available ciphertexts for
the corresponding keyword. Thus when the query asks for the missed ciphertext versions, those documents
do not get into the resultant set even when they contain the required keywords. In Section 5.3 we propose to
inject the missed ciphertext versions to prevent this problem. However, the results show that LRSE loses less
than 1% accuracy from top-3 to top-30, which is tolerable. More importantly, this happens to less relevant
documents to the query.

6.2 Document Vectors
6.2.1 Entropy of Document Vectors
We employed Shanon entropy to calculate entropy of the original and LRSE document vectors: H(V) =
∑ n
− i=1 p i log p i, where V is the document vector, n is the number of keywords, and p i is probability of key-
2
word i.

TocalculateLRSEentropyprogress,wedefineameasureasentropyimprovement H imp = (H(V li )−H(V oi ))/H(V oi ),
where H(V li )istheentropyofthedocument i vectorinLRSEand H(V oi )istheentropyoftheoriginaldocument
vector of the same document.

In Section 5.1 we proved that the entropy of document vectors which are generated by LRSE are greater than
or equal to the entropy of original vectors.The simulation results emphasizes our theorem and shows at least a
25%entropyimprovementinallofthedocumentsandinsomedocumentsaround90%. Figure6demonstrates
the first 20 documents entropy improvement.

Note that, some documents such as “Document6” in Figure 6 may possess a high frequency of some keywords
because specifically discuss a special topic. For example, legal terminologies are used heavily in the congress
documents which increases their frequencies and drastically reduces the entropy of the document vectors
and threatens owner/user privacy. In LRSE, we break down these high frequency occurrences to a couple of

frequenciesintheaveragefrequencyrange(see Section4.1). Forexample, assumethefrequency ofkeyword w i
72
in document D j is 72 and the threshold is τ = 25, thus φ i = d e = 3. Hence, LRSE divides the w i frequency to
25
3 smaller parts (say 22,24,26) which are close to the frequency average (25), and then generates 3 ciphertexts
using the chaining notion for each part. Because of this LRSE feature, our observation and result simulation
show a 90% entropy improvement in some of the documents.

6.2.2 Standard Deviation of Document Vectors
A low standard deviation (σ) represents that most of the keyword frequencies are very close to the average
and consequently to each other. Note that, the keywords can be deduced or identified in a strong attack model
that the cloud server is equipped with more knowledge such as the term frequency statistics of the document
collection [16] . For example, the frequency of economic terminologies is much higher than the other keywords
in a budget document. Thus, the more keyword frequencies become closer to each other the more difficult

98 99 100 101 102 103 104 105 106 107 108