Page 95 - Read Online
P. 95
Page 88 Salmani et al. J Surveill Secur Saf 2020;1:79–101 I http://dx.doi.org/10.20517/jsss.2020.16
asymmetricinner-productmechanism [14] whichpreventsthecloudfromlearningrelevancescorebetweentwo
encrypted documents, and is only able to calculate the relevance score between the encrypted query vector Q ˆ
and encrypted document vectors T. The following describes scheme II in detail:
ˆ
• Setup.This part of the scheme consists of the same steps as Scheme I. We first extracts the corpus keyword
dictionary ∆ d from the entire corpus D and then generates the document-term matrix γ. Afterwards, we
generate the required-ciphertext vector φ. In contrast with Scheme I we do not need to actually apply
Algorithm 1 and generate multiple ciphertexts for each keyword w i. Instead, we apply the concept of the
chaining notion on the document vectors and represent high frequency keywords with more than one
element in the document vectors. For example, assume φ i = 5, so keyword w i is represented with 5 positions
in the document vectors. This property has less system cost and improves the system efficiency. Then we
call KeyGen algorithm to generate a secret key K that will be used to encrypt the documents.
• KeyGen. The key generation algorithm is slightly different from the previous scheme. Beside a secret key
K, the KeyGen should create a (d ×d ) invertible random matrix m where d is the sum of all elements in φ
0
0
0
λ
(i.e. all of needed encrypted versions). The data owner runs KeyGen(1 , d ) algorithm. The LRSE KeyGen
0
algorithm takes in a security parameter λ and a number d and it returns a secret key K and an invertible
0
(d × d ) matrix m. This matrix will be used to encrypt the query and document vectors, so like the secret
0
0
key, it should be protected.
• BuildIndex. The LRSE BuildIndex(D, φ) takes in a document corpus and a required-ciphertext vector φ
and outputs the plain index I. This algorithm first generates a normalized document vector T i for each doc-
ument D i based on t f-idf mechanism [19] . Note that each keyword w i is presented with multiple positions
in the document vector based on value of φ i. Then using the document vector set T, the algorithm builds
a plain index. We adopt a tree-based index structure called keyword balanced binary (KBB) tree proposed
by Xia et al. [16] . Their secure index structure uses a “Greedy Depth-First Search(GDFS)” algorithm to find
the most related nodes (documents) to the query.
In the KBB tree, each node u consists of 5 elements: node ID, two pointers to the left and right child of
node u, document ID (set to null in case u is an internal node), and the last element is a vector T i that
denotes the normalized “t f ×idf” values of the keywords in the document D i. If u is an internal node, each
element in vector T i gets the maximum value of the corresponding keyword among u’s child nodes.
In the index construction phase, we generate a node for each document in the corpus. These nodes are the
index tree’s leaves. The internal nodes are generated by generating a parent node for each two nodes (same
strategy we apply to build a balanced binary tree). The parent node gets the highest value from its children
for each element in its T i vector.
Figure 3 provides an example of the index tree that is applied in our approach. The corpus in this example
consists of 6 files(documents) f 1 ,. . ., f 6 and 4 keywords(cardinality of the dictionary d = 4). Assuming the
query vector Q is equal to (0,0.83,0,0.24) and parameter k is set to 3 (i.e. number of documents returned
to the data user). The figure shows the search process. The search process starts from the root node n and
calculates the relevance score of the n 11 (1.07) and n 12 (0.428) to the query vector and moves to the n 11
due to its higher relevance score. The search continues and reaches leaf node f 4 with relevance score of
1.07 and then reaches f 3 and f 2 with 1.127 and 0.498 relevance scores, respectively. Afterwards, the search
algorithm gets to the f 1 with 0.524 relevance score and replace f 3 in the result list (because we should return
the top-3 relevant files and f 1 relevance score is higher than f 3 score). Finally, the search algorithm goes
back to n 12 node (based on Depth First Search algorithm) and stops there because the relevance score of
the n 12 (0.428) is less than the minimum relevance score (0.524) in the result list. Note that in practice T
and Q vectors are encrypted using secret key K and matrix m. We refer the reader to Xia et al. [16] for more
information about the index structure.