Keyword: - Geospatial documents, multilevel hashing, clustering, semantic, feature, ranking.
The past decade has witnessed rapid improvement in the field of information technology and computing. This improvement enables us to present the spatial information in digitize format. The growth and booming of the internet render its user to access huge amount of spatial information. However, the source of data and diversity in spatial data poses a great challenge for the user to discover knowledge from a large geo-spatial database. This is due to reason that the geo-spatial database contains more redundant and irrelevant information. Therefore, it is more important to have intelligent as well as efficient information retrieval technique.
With this requirement, this paper proposed a new SFAIR framework for fast and effective information retrieval. Figure 1 shows the overall flow of SFAIR framework.
Figure 1: Overall flow of SFAIR
The documents collected from various sources are clustered and indexed for fast retrieval. Clustering is a technique to identify inherent groupings of the spatial text documents. Normally, clustering of text documents depends on the term frequency to detect the importance of the term in the document. Though the term frequency is implemented widely, it is not advisable to use it. Since, two or more terms in the document have the same frequency, but one term may provide more to the meaning of the sentence than the other terms. So here a context-based clustering technique named CQW is used to cluster the documents. CQW captures the semantic structure of each term present within the sentence as well as in the documents. The following terms are widely used in this article.
Verb argument structure: Consider the statement "The old man helped the young man". Here the old man and the young man are arguments of the verb "helped".
Label: The arguments present in a sentence are referred as label. In the above example, "The old man" is the agent, i.e. subject and "the young man" is the object.
Terms: Terms can be either an argument or verb. It can also be either a phrase or a word.
Concept: A term that is labeled
A term that contributes more meaning to the sentence is identified through a semantic role labeler. Those terms are referred as concepts it can be either words or phrases that are totally depended on the sentence’s semantic structure. In addition to this, a similarity measure that is based on the context is introduced to detect the document similarity. Document clustering depends on the similarity measure. Though the clustering process narrows the search area for retrieval, finding the location of the documents is difficult. Hence, store aware document maintenance is required for fast and relevant retrieval of information. With the aforementioned requirement, this article has implemented a multilevel hashing technique to assist the fast retrieval of clustering technique. Keywords in the documents of the corpus and its associative URL along with the frequency of the particular keyword are the input for the hashing technique.
The clustering and indexing process makes the corpus ready for retrieval. On receiving the query, information retrieval component, triggers the FPD technique and finds the feature set that corresponds to the keyword present in user query. The keyword in the query is used to detect the cluster. Selected features and the keyword together act as the search term. These search terms are examined in the hash table, and its corresponding document’s URL is determined. The documents presented in the URL are processed to detect keyword density, which determines the percentage of the number of times a keyword appears in the document compared to the total number of words in the documents. In addition to keyword density, probability of the occurrence of the keyword in the document is also used to retrieve the relevant document. These two values are used to estimate the document weight. The documents having the highest weight are regarded as the most related documents. The documents that have weight are retrieved, and they are ranked for display to the user.
SD technique ranks the retrieved documents. The ranking is purely based on the semantic of the context. SD computes the score to prioritize the documents in the corpus. An effective ranking algorithm in a search engine can help for efficient retrieval process. Principal-based probability approach is a traditional technique for ranking the documents. This article uses the lexical similarity, synonym, hypernym, hyponym, holonym, and meronym determination for computing the score for each document. The lexical similarity measures the degree to which the word sets of two different languages are similar. The synonym determination analyzes the exact meaning of the user concern. The broad meanings of the words in the documents are analyzed and resolute through the heypernym determination. The hyponym determinations give a more specific meaning for a particular word. A concept that forms as a part of a concept is detected through the holonym determination. Meronym determination discovers a member of an information set. All these determinations are incorporated to find the scores for the retrieved document. The documents that have the highest score are prioritized and displayed to the user.
Rest of this article is structured as follows: Section 2 discusses various works proposed in different articles, which is related to the SFAIR framework. The SFAIR framework’s four different components are discussed and detailed explanations are given in the section 3. Experimental results of the proposed system are presented in the section 4. Finally, section 5 concludes the proposed work with the summary along with the suggestion for the future work.
This section deals with the existing works proposed by various research scholars related to the proposed SFAIR framework’s component. This article discussed the works associated to the text documents clustering, indexing, similarity measure, and ranking. Following subsection has the corresponding details.
For clustering the text document  used frequent item set, which indicated through mutual overlap of the frequent set corresponds to the supporting documents. This article proposed the following two techniques for clustering the documents.
Frequent Term-based Clustering (FTC): This follows the bottom-up approach and produces flat clustering
Hierarchical Frequent Term-based Clustering (HFTC) : This method generated a hierarchical document clustering
Similarly,  proposed text clustering named Term Contribution (TC), which incorporated the technique unsupervised feature selection technique. Iterative Feature Selection (IFS) method of this paper chose the feature, which performed the feature selection and clustering process in an iterative manner within a unified framework. This technique resolves the problems faced by the supervised feature selection methods such as unavailability of the needed class label information. Another new technique found in  for document clustering combined the following two well-known approaches.
Synsets extended the vector space representation, which explores the semantics
BiSec-k-Means algorithm used for clustering based on the similarity between the documents that can exploit through cosine measurement.
Conceptual description extracted from the framed clusters used for cluster arrangement.
The article  proposed a Phrase-based document similarity through Suffix Tree Document (STD) model for text clustering. The phrases of a document mapped in a suffix tree into an M-dimensional Vector Space Document (VSD) model. This connected the similarity between the documents. The vector space based model suffered from a disadvantage that it ignores the meaning of natural language since it eliminates the word sequence. To overcome this problem,  introduced two novel clustering algorithms.
Clustering based on Frequent Word Sequence (CFWS)
Clustering based on Frequent Word Meaning Sequences (CFWMS)
These algorithms depended on frequent word sequence and discovered through the Generalized Suffix Tree (GST). An effective concept-based clustering technique in  enhanced the clustering of text documents, which uses document similarity. In addition, this paper proposed a technique for document similarity, which depended on concept and measured the similarity on the sentence, document and on corpus levels.
Paper  introduced a fuzzy self-constructing feature clustering (FFC) an incremental clustering algorithm to cluster the text documents. The similarity defined by considering both the mean and variance of the cluster. Another fuzzy-based clustering method proposed in , which operates on data in the form of the square matrix of pairwise similarities and data were represented in the graph. Operation of this algorithm carried within a framework named Expectation-maximization.
Majorities of the clustering techniques focused and proposed for centralized environments. Though there subsisted few clustering techniques for clustering the document in distributed environments, they failed to scale on highly disseminated areas. A peer compared the documents without losing the significant quality of the clustering.
Enormous number of indexing techniques proposed in the previous centuries. All those techniques assumed that the text that requires indexing should contain either little or no structure. Therefore, those techniques became unsuitable for the structured documents. With this, following two approaches discussed in  for accessing the documents.
Position of words and elements, i.e. words offset used to evaluate the query. This allows the slightly smaller indexes
Path model, an another approach to effectively estimate and to resolve the given query
Automatic indexing of terms proposed in the paper  to locate the content location. Hashing, the most discussed technique for indexing the documents. An active clustering, a method proposed in  to index the document. This method identifies the major and most informative points to the label and built a labeled pair. This article used data uncertainty and batch mode algorithm to measure the informative points and to speed up the selection a batch mode algorithm respectively. Another technique proposed in  assisted the information retrieval process, which projected a novel semi-supervised hashing method. This method exploited the benefits of information from both the labeled and unlabeled data. It uses the square of the Euclidean distance to compute and measure the Hamming distance.
Most of the techniques discussed so far have computed the content relevance for indexing. These relevance determinations normally based on the general theory, rather than on objective experiments. As a solution to this,  proposed a technique that finds the final index’s weight during the update process of index weight according to the term weight.
Usually the quires regarding geographic information composed of keywords and location. As a response to geographic quires, the search engine retrieves a set of documents that most spatially or textually relevant to the keywords and location of the given query. There existed a need for an effective indexing for the search engines in order to handle both the text and spatial aspect of the documents and to answer the geospatial queries. To solve this problem,  proposed an effective geographical search engine named Global Atlas, which influences on the paradigm of cartographic and provides a natural support for indexing. As an alternate to the above technique, the documents indexed according to the geographic relevance in . For document indexing this paper followed the steps given below.
Detect the geographically related documents
Identify the location identifier associated with document
Detect the locations within the range that predetermined and associated with the document
Then index the document as if it determined in the previous step
Figure 2: Architecture of SFAIR Framework
In addition to the above methods,  designed an IR-Tree for indexing. This technique performed the following task spatial filtering, textual filtering, relevance computation, and ranking. This enables an effective searching of documents for the given query.
The core problem of clustering was measuring the similarity among the documents. Traditional algorithms used the similarity measure very limitedly. Five different techniques namely (1) Dictionary-based approach, (2) Roget-structured Thesauri based approach, (3) WordNet and other semantic network based approach, (4) Taxonomic path length, and (5) Content-based integrated approach compared in  to determine the lexical similarity. The comparison results in  showed that the content-based similarity measure performs better than the other four techniques.
Though the content-based technique found as superior, it required to make flexible to access the complicated structures of data sets. To handle such structures,  proposed a multiple similarity technique. This measure derived the concept of local modifications then combined with repeated modifications of local similarity measurement and generated a hierarchical clustering. Measuring the similarity requires to discover keywords, features and other credential information. To find those details,  proposed a technique that incorporates the vector space model and language modeling methods.
A semantic similarity among the tags proposed in , which study the pair-wise relationship between tag, resource and user. An observation made from the article  that information retrieval technique can be enhanced semantically. Therefore, this paper suggested the usage of similarity measure depending on semantic.
Various methodologies proposed in past centuries to improve and enhance the information retrieval process. Among those techniques, ranking, the most promising method to enhance the search operation. Measuring the closeness of the retrieved sites for a given query discussed and a new approach called ontology concepts was proposed in . Following these techniques, a flexible yet simple transductive meta-algorithm proposed in  with the idea to adapt a training procedure for the documents required rank. An empirical study in  used to measure the influence, to which document ranking performance has on the ranking candidate expert.
A technique for optimizing the web search proposed in , which tremendously reduces the requirement for long search iteration. This paper mainly focused on query disambiguation. Two novel ranking algorithms proposed in  depended on reinforcement learning concepts. A score for the web pages determined through state value function to rank the web page. Hash based ranking technique called Rank Hash Similarity (RHS) proposed in  used to make the searching process of similarity efficient. A new notation introduced for web page ranking in , which uses an information content approach based on the web logs of user.
The corpus has enormous documents collected from various sources. Retrieving the documents for queries from the corpus is a daunting work. Therefore, authors of this paper proposed a technique named SFAIR, which helps to retrieve the most relevant documents in a short duration. The architecture of the proposed technique is as shown in the Figure 2. The overall architecture consists of the following components (1) Clustering, (2) Indexing, (3) Retrieval, and (4) Ranking.
The subsequent subsection describes the process of each component present in the architecture diagram.
The documents collected from various sources may have different contexts. Some documents may have the details about forest while others have mountain details. Likewise, the corpus of GIS has various documents having details about different context. With this if a user quires about a mountain, then the retrieval system search all the documents to find the related documents to the user query. It is not recommended to search the documents that have details about rivers or forest. Since it consumes more time to find the relevant documents for the given query. Therefore, to reduce the time required for retrieval process, authors have clustered the documents that belong to the same context e.g. the documents that have details about the mountains are grouped under single cluster. Similarly, various clusters are framed depending on the context of the document. In order to cluster the documents, authors have used a technique named Context-based Query Weighting (CQW) approach. The context of the document is used by the CQW approach for clustering process.
Algorithm 1: Context-Based Analysis
It is most required for CQW method to determine the relationship between the verb, and its associated argument presented in the same sentence. The extracted information has potential information for analyzing the terms within a sentence. In addition to this, authors have also used the information about who is doing what to whom to detect and clarify the contribution of each term of a sentence. The CQW approach uses the semantic structure of each term presented within the sentence as well as in the document. The context can be either word or phrase, which is computed by the document and sentence. If any document arrives newly to the corpus, then its concepts are extracted and matched with the documents that are already processed. Figure 3, represents the overall flow of the CQW method. This method takes raw document as its input.
Figure 3: Overall flow of CQW approach
Raw document that is given as input to CQW approach is preprocessed in order to make it suitable for clustering process. The following steps are taken to preprocess the documents.
Document’s individual sentences are separated
HTML tags are removed from the web documents
Stop words are detected
POS tagging is the process in which a particular part of speech is detected for a word in the given text. The conjugation in a sentence is referred as object, which is used to represent the sentence semantics.
Computation of CTF, and TF for clustering
Authors have analyzed the context based on the sentence and document. This helps the accurate clustering of documents.
Contextual Term Frequency (CTF)
Term Frequency (TF)
Document Frequency (DF)
The CTF is computed to analyze the context at sentence and at the document level. The CTF manipulation of context c in a sentence s denotes the number of occurrences of c in a verb argument structure of s. The context that appears frequently in different verb argument structures of the same sentence of s has major role in contributing the meaning of that particular sentence. At this stage, it is a local measure on the sentence level. Similarly, authors have computed the CTF value for the document level also. The CTF value for document d can be obtained through the equation 1.
In equation 1, is the total number of sentence that has the context c in d. Likewise, the term frequency TF is used to measure the number occurrences of a c in d. In addition to CTF and TF, DF is used to refer the number of documents in the corpus that has the context c. This is the local measure on the document-level. Algorithm 1 gives the steps to analyze the document based on the context as well as the process of computing the CTF, TF and DF.
With the computed values of CTF, TF and DF authors distinguish the existing documents. These values express the availability of the context in the document, which helps in measuring the similarity. This measure depends on matching concepts at sentence and document levels. The importance of the concept with respect to the sentence is identified by CTF and TF.
Following factors are considered for measuring the similarity between the documents.
m: number of matching context is measured for each document
n: the total number of that contains the matching context Ci computed for all documents
The CTF computation for each context Ci in S for each document di where is exported. Similarly for each document di, the similarity depends on DFi, the frequency for the documents where. The CTF is the pre judging factor for evaluating the similarity between documents. The fact lies in frequency of the context appears in the verb SV parametric structure. If the frequency rate is higher, then document is also more similar. The two documents similarity is estimated by the equation (2).
Equation (4), the value elucidates denotes the weight of context i in document d.
The TFW (term frequency weight) corresponds to its contribution in document level.
Where CTFW represents the weight of context i in document d (expressing how far that context is semantically related) where The summation of TFW and CTFW represents the effective contribution of the context providing semantic meaning to the document.
Equation (2) exemplify for each concept in the verb argument structure in each document d, its length is evaluated which is denoted as ci. Ci is evaluated on considering each verb agreement structure that is enclosing a matched concept, and total number of documents is denoted as N, in the context.
In (3), denotes the value the weight of the concept i on the extent of context occurrence. The summation of denotes precise dimension of each context with respect to its semantically contributing perspective over the entire context.
The above said steps are conceded for all document, finally similarity matrix is acquired for all documents. After computing the similarity between each document the clustering is done using the HAC technique.
Indexing the clustered documents
Through the above approach, the documents are clustered further to reduce the time consumption of retrieval process authors have indexed the documents. Each document inside the cluster is indexed through the multilevel hashing technique. This method helps to increase the retrieval speed. The process of indexing has two stages. At the first stage a database is built and at the second stage the query words are indexed in the hash table and access the sorted table for ranking the web page. The clusters and its associated documents present in the corpus are read thoroughly to detect the keywords. The keywords that are detected are related to various documents. Each document in the cluster corresponds to an URL. Those URL along with the frequency of keywords is stored in the hash table. The frequency of the keyword can be detected using the formula that is presented in the equation (6).
Equation (6) express that the search term has the appeared times in the document. The hashing structure for storing the URL along with the frequency of an specific keyword is shown in the Figure 4.
Figure 4: Hashing of URL and frequency of the keyword
The constructed has table as shown in Figure 3 is not applicable to retrieve the most relevant document quickly since it is based on the URL. Therefore in order to improve the performance the hash table is rebuilt as given in the Figure 5.
Figure 5: Hash table sorted by frequency
For the information retrieval technique the corpus and the user query act as input. The query from the user is retrieved through the user interface component. On receiving the query, interface component send it to the retrieval component. The retrieval system access the corresponding cluster in the corpus for the given user query to retrieve the most related document. The Feature Probability and Density (FPD) technique is used for the retrieval process. The overall flow of the FPD technique is portrayed in the Figure 6.
Figure 6: FPD technique
From the Figure 6 it is evident that the FPD consist of three steps that are explained in the following subsection.
This component collects the features for all the keywords that are present in the indexing process. These features are trained in prior, and they are default for a particular keyword. This is the most important step in the information retrieval process, since based on this the most related and accurate documents are retrieved. Consider a keyword named "Himalayan mountains", for this keyword following features are selected and frame an feature set snow abode, world’s highest mountain chain, awe-inspiring power, magnificent mountain, massive mountain, world highest, world’s highest peaks, permanent ice, extreme cold, fold mountain, Indo-Australian plate, fresh water, large perennial rivers, Indus Basin, Ganges-Brahmaputra Basin. These features along with Himalayas are considered as the search term (ST).
The density of the search terms are taken to determine whether the document is relevant to the user query. The density of these search terms are computed and mathematically expressed through the equation (7).
In equation (7) is the density of the search term, is the frequency of the search term, is the total number of words in the documents, express the total number of documents in the cluster, and the number of documents having the search terms respectively. This value gives the percentage of the number of times a keyword appears in the document compared to the total number of words in the document.
Computation of probability of
In addition to the density, probability of search term occurrence is computed from the equation (8).
The document weight is computed using the equation (7) and (8). It is computed for each document in the cluster. Mathematical equation (9) is used to determine the.
Where the n and ST denotes the total number of documents in the cluster and number of search terms in the feature set that corresponds to the given query. The documents that have the document score are considered as the documents that are related to the given query and they are retrieved for ranking process. Whereas, the documents that has highest weight are considered as the most related documents. This value is also used for the ranking process.
The documents that have value for document weight is considered as the relevant to the user query. The ranking process is taken to display the most related documents in prior to other documents. The ranking process is taken through the semantic density (SD) technique. The semantic density computes five different values for the documents that are retrieved in the above step. Initially, the score is calculated to determine the query occurrence in the documents through the following steps.
The occurrence of entire user query in the document is computed as S1
The occurrence of each word of the user query in the document is manipulated as S2
The synonym of the query is examined and its corresponding score is taken as S3
The cumulative weight of the query is computed through the summation of above three values, i.e.
In addition to the query occurrence, the lexical similarity is computed using the equation (10). Using these values the document score is computed.
The in equation (10) represents the weight of the document and the query, n represents the total number of documents retrieved in the above step, and the function can be determined through the equation (11).
Here represents the weight of the word, LCS is the least common superconcepts of. Similarly, occurrence of hypernym set h1, hyponym set h2, holonym set h3, and meronym set h4 in the document is computed as, respectively. Therefore, the final document score can be determined using the equation (12).
The documents that are retrieved in the above step are ordered using value through the SD technique. The document with highest score is considered as most relevant document and displayed first in the web page.
To test the effectiveness of the proposed SFAIR framework several experiments are conducted for individual component as well as for the entire framework. The experimental setup consists of a corpus containing 50 text documents. All the collected documents contain details about geographic information. It mainly contains the details only about forest, rivers and mountains. The following subsection deals with the results of individual component.
CQW approach collects the raw documents and preprocesses it before it is clustered. The preprocessing consists of four stages as explained in section 3.1.1. Consider the following content of a document present in the corpus.
Himalaya is a Sanskrit word which literally means "Abode of Snow" - from hima, "snow," and alaya, "abode" - a term coined by the ancient pilgrims of India who travelled in these mountains. For Tibetans, Indians, Nepalese, and many of the other inhabitants of the Himalayas, the mountains continue to be the predominant factor in their lives. The beauty of the Himalayas has lured visitors to this region since olden times. And being the world's highest mountain chain, it constitutes the greatest attraction to climbers and trekkers throughout the world. But more than anything else, the Himalayas represent the awe-inspiring power, beauty, and grandeur of Nature. The Himalayas extend from west to east in a massive arc for about 2500 kilometers (1550 miles). Covering an astounding area of 612,021 sq. km, the vast mountain chain passes through the Indian States of Jammu and Kashmir, Himachal Pradesh, Uttar Pradesh, Sikkim and the Himalayan kingdom’s of Nepal and Bhutan. The Tibetan Plateau - the roof of the world - forms the borthern boundary of this magnificent mountain system while lower extensions of the Himalayas branch off from eastern and western frontiers of these mountains. The Himalayas can be classified in a variety of ways. From south to north, the mountains can be grouped into four parallel, longitudinal mountain belts, each with its unique features and distinctive geological history.
After preprocessing the above document looks as below.
Himalaya Sanskrit word literally means "abode snow" - hima snow alaya abode" - term coined ancient pilgrims India travelled these mountains. tibetans, indians, nepalese, many other inhabitants himalayas, mountains continue predominant factor their lives. Beauty Himalayas lured visitors region since olden times. Being world highest mountain chain, constitutes greatest attraction climbers trekkers throughout world. More than anything else, Himalayas represent -inspiring power, beauty, grandeur nature. Himalayas extend west east massive 2500 kilometers (1550 miles covering astounding area , sq. km, vast mountain chain passes through Indian states Jammu Kashmir, himachal pradesh, uttar pradesh, sikkim Himalayan kingdom’s Nepal bhutan tibetan plateau roof world forms borthern boundary magnificent mountain system while lower extensions Himalayas branch eastern western frontiers these mountains. Himalayas classified variety ways. South north, mountains grouped into four parallel, longitudinal mountain belts, each unique features distinctive geological history.
These preprocessed documents are then clustered based on the equation (2). Depending of the values obtained from (2) the documents are clustered. As a result of this process, three clusters are formed for the corpus with above specification. The first cluster contains 20 documents containing details about mountains, whereas the second and third clusters have 15 documents each having the information about forest and rivers. Table 1 provides the information about the clusters formed and number of documents present in each cluster for the given corpus.
Table 1: Clustering details
Number of Documents
The performance of the CQW technique is measured based in the processing time and accuracy. Totally five experiments are conducted with various size of the corpus. During the first experiment, corpus contains 10 documents and gradually increased as 20, 30, 40, and 50 for second, third, fourth, and fifth test. The documents for the first four tests are selected randomly from the fifty documents. The results are also compared with the standard clustering technique named SVM weighted technique. Figure 7, shows the processing time and Figure 8 represents the accuracy of both the existing and proposed methods.
From the Figure 1 and Figure 2 it is clear the time required for CQW to cluster the documents is very lesser than the existing standard SVM weighed technique as well as the prediction accuracy is also higher for CQW. Therefore, the proposed CQW will consume very less time to cluster the documents in the corpus with maximum accuracy.
Figure 7: Processing Time for CQW vs. SVM
Figure 8: Accuracy Rate for CQW vs. SVM
Hashing and Information Retrieval
On providing the query as "Himalayan Mountain", the information retrieval component redirects the query to the C1 cluster where there are 20 documents. These documents are indexed using multilevel hashing along with the frequency of given keyword’s search term. The search terms are detected by the FPD technique. The search terms for "Himalayan Mountain" are snow abode, world’s highest mountain chain, awe-inspiring power, magnificent mountain, massive mountain, world highest, world’s highest peaks, permanent ice, extreme cold, fold mountain, Indo-Australian plate, fresh water, large perennial rivers, Indus Basin, Ganges-Brahmaputra Basin. These search words are found in hash table and its associated documents are retrieved. Time taken to build the hash table and the time consumed by accessing the document without indexing and with indexing is given in Table 2 and Table 3. The performance between with and without indexing is carried through varying the size of the corpus as in previous section. The time values presented in Table 2 and 3 are measured in milliseconds. It is shown in those two tables that the average search time and sorting time are lower for the indexed documents. To retrieve the related documents, weights are computed for the indexed document in the corresponding cluster. For the above query, 8 documents are retrieved by the FPD for the search term when the corpus contains 50 documents. These documents are retrieved through weight computation, which is indexed for the generated search terms mentioned above. The computed weights for the retrieved documents are shown in Table 4. The weight is computed from the equation (9), where the search term’s weight and occurrences in equation (9) is computed from the equations (7) and (8) respectively.
Table 4: Document score
The performance of the retrieval technique FPD is analyzed for 5 different queries and measured for the accuracy rate, time consumed for retrieval, precision, and recall. The function of FPD is compared with the Latent Semantic analysis based approach. Precision is a measure that expresses the fraction of retrieved documents. Figure 9 gives the comparison of precision between the proposed and existing. Precision can be computed from the equation (13). It is clear in the graph represented in Figure 9 is that the proposed technique has higher precision value than the existing technique.
Figure 9: Precision comparison for Proposed vs. Existing
Likewise the recall is determined from the equation (14). This is another factor for measuring the efficiency of the information retrieval technique. Figure 10 shows the comparison graph between the proposed and existing technique. Recall value for the proposed technique is higher than the existing technique, which can be observed clearly in the Figure 10.
Table 2: Performance of retrieval without indexing
Number of Documents in the corpus
Time taken to build hash table
Minimum Access Time
Average Search Time
Maximum Access Time
Total Access Time
Table 3: Performance of retrieval with indexing
Number of Documents in the corpus
Time taken to build hash table
Minimum Access Time
Average Search Time
Maximum Access Time
Total Access Time
Figure 10: Recall comparison FPD vs. Latent semantic
Similarly the accuracy rate and the time consumption is also measured and compared for both the FPD and existing technique. It is portrayed in the Figure 11 that the proposed technique FPD retrieves the most related documents from the corpus for the given user query than the latent semantic-based approach. In addition, the proposed technique consumes less time to retrieve the pertinent document from the corpus than the existing technique. The time consumption can be observed from the graph in Figure 12.
Figure 11: Accuracy rate for Proposed vs. Existing
The documents retrieved in the above section are ranked using the equation (12) through SD technique, which computes score for the retrieved documents. The SD technique is compared with SMI approach for the factors like computation time and accuracy. Figure 13 portrays that the time consumed for both the proposed and existing techniques to rank the documents. It illustrates that the SD technique consumes less time than the SIM (existing) approach. The time is computed in terms of seconds.
Figure 15: Accuracy
Figure 16: Time requirement
Figure 12: Time consumption for FPD vs. Latent Semantic
Figure 13: Time requirement for SD and SIM approach
Similar to Figure 13, Figure 14 describes the accuracy of the two ranking techniques. It is shown in the graph of Figure 14 that the proposed technique SD ranks the document more accurately than the existing technique.
So far the individual component of SFAIR’s performance is measured and compared with their existing technique. Therefore, it is mandatory to check the overall performance of the SFAIR technique. Following figures show the accuracy of SFAIR technique and the contribution of each component in SFAIR technique to retrieve the related documents. The accuracy of the SFAIR is measured in percentage. Figure 15(a) express that the overall performance of the SFAIR technique and also the contribution of CQW technique. Similarly, from Figure 15(b), 15(c), and 15(d) it is noticed that without indexing, FDF and SD technique, the overall performance of the SFAIR architecture reduces highly. Therefore, it implicitly expressed that each component contributes in the overall process of SFAIR.
Figure 14: Accuracy rate for SD vs. SIM approach
In addition to the accuracy parameter, time required for retrieving the documents are also experimented and shows the contribution of CQW, indexing, FPD, and SD technique in Figure 16(a), 16(b), 16(c), and 16(d).
Geospatial information retrieval is constructed on two well-established areas, namely Geographic Information System and Information Retrieval. With this article presents a framework termed SFAIR for efficient relevant document retrieval from a corpus. SFAIR framework consists of four components that support both searching and indexing information. Here, CQW technique is proposed to cluster the documents, the clustered documents are indexed using multilevel hashing for fast searching and accessing of documents. In addition, the FPD technique is proposed to retrieve the documents, which are pertinent to the user query. The retrieved documents are ranked through the SD approach.
The components of the SFAIR framework are evaluated using a corpus containing fifty documents and compared with their corresponding existing techniques in terms of accuracy and time as major factors. The clustering component CQW is compared with SVM technique and from the results it is observed that the CQW outperforms the SVM technique. Similarly, the proposed IR technique, FDP and ranking approach, SD are compared with the Latent semantic technique and SIM approach respectively; from the result graph it is evident that the proposed FPD technique retrieves the most related documents and SD ranks the document better than the SIM technique. Both the FPD and SD consumes less time than the existing technique. Moreover, the contribution of each component in the performance and efficiency of the SFAIR framework is analyzed individually.
It is planned to make the SFAIR framework adaptable for all type of documents, i.e. not only for geospatial documents.