Page 95 - Read Online
P. 95

Page 88                 Salmani et al. J Surveill Secur Saf 2020;1:79–101  I http://dx.doi.org/10.20517/jsss.2020.16


               asymmetricinner-productmechanism [14] whichpreventsthecloudfromlearningrelevancescorebetweentwo
               encrypted documents, and is only able to calculate the relevance score between the encrypted query vector Q ˆ
               and encrypted document vectors T. The following describes scheme II in detail:
                                           ˆ

                • Setup.This part of the scheme consists of the same steps as Scheme I. We first extracts the corpus keyword
                  dictionary ∆ d from the entire corpus D and then generates the document-term matrix γ. Afterwards, we
                  generate the required-ciphertext vector φ. In contrast with Scheme I we do not need to actually apply
                  Algorithm 1 and generate multiple ciphertexts for each keyword w i. Instead, we apply the concept of the
                  chaining notion on the document vectors and represent high frequency keywords with more than one
                  element in the document vectors. For example, assume φ i = 5, so keyword w i is represented with 5 positions
                  in the document vectors. This property has less system cost and improves the system efficiency. Then we
                  call KeyGen algorithm to generate a secret key K that will be used to encrypt the documents.

                • KeyGen. The key generation algorithm is slightly different from the previous scheme. Beside a secret key
                  K, the KeyGen should create a (d ×d ) invertible random matrix m where d is the sum of all elements in φ
                                                 0
                                             0
                                                                                0
                                                                            λ
                  (i.e. all of needed encrypted versions). The data owner runs KeyGen(1 , d ) algorithm. The LRSE KeyGen
                                                                               0
                  algorithm takes in a security parameter λ and a number d and it returns a secret key K and an invertible
                                                                  0
                  (d × d ) matrix m. This matrix will be used to encrypt the query and document vectors, so like the secret
                       0
                    0
                  key, it should be protected.
                • BuildIndex. The LRSE BuildIndex(D, φ) takes in a document corpus and a required-ciphertext vector φ
                  and outputs the plain index I. This algorithm first generates a normalized document vector T i for each doc-
                  ument D i based on t f-idf mechanism [19] . Note that each keyword w i is presented with multiple positions
                  in the document vector based on value of φ i. Then using the document vector set T, the algorithm builds
                  a plain index. We adopt a tree-based index structure called keyword balanced binary (KBB) tree proposed
                  by Xia et al. [16] . Their secure index structure uses a “Greedy Depth-First Search(GDFS)” algorithm to find
                  the most related nodes (documents) to the query.

                  In the KBB tree, each node u consists of 5 elements: node ID, two pointers to the left and right child of
                  node u, document ID (set to null in case u is an internal node), and the last element is a vector T i that
                  denotes the normalized “t f ×idf” values of the keywords in the document D i. If u is an internal node, each
                  element in vector T i gets the maximum value of the corresponding keyword among u’s child nodes.

                  In the index construction phase, we generate a node for each document in the corpus. These nodes are the
                  index tree’s leaves. The internal nodes are generated by generating a parent node for each two nodes (same
                  strategy we apply to build a balanced binary tree). The parent node gets the highest value from its children
                  for each element in its T i vector.

                  Figure 3 provides an example of the index tree that is applied in our approach. The corpus in this example
                  consists of 6 files(documents) f 1 ,. . ., f 6 and 4 keywords(cardinality of the dictionary d = 4). Assuming the
                  query vector Q is equal to (0,0.83,0,0.24) and parameter k is set to 3 (i.e. number of documents returned
                  to the data user). The figure shows the search process. The search process starts from the root node n and
                  calculates the relevance score of the n 11 (1.07) and n 12 (0.428) to the query vector and moves to the n 11
                  due to its higher relevance score. The search continues and reaches leaf node f 4 with relevance score of
                  1.07 and then reaches f 3 and f 2 with 1.127 and 0.498 relevance scores, respectively. Afterwards, the search
                  algorithm gets to the f 1 with 0.524 relevance score and replace f 3 in the result list (because we should return
                  the top-3 relevant files and f 1 relevance score is higher than f 3 score). Finally, the search algorithm goes
                  back to n 12 node (based on Depth First Search algorithm) and stops there because the relevance score of
                  the n 12 (0.428) is less than the minimum relevance score (0.524) in the result list. Note that in practice T
                  and Q vectors are encrypted using secret key K and matrix m. We refer the reader to Xia et al. [16]  for more
                  information about the index structure.
   90   91   92   93   94   95   96   97   98   99   100