Top-k query is an important operation to return a set of interesting points in a potentially huge data space. It is analyzed in this paper that the existing algorithms cannot process top-k query on massive data efficiently. This paper proposes a novel table-scan-based T2S algorithm to efficiently compute top-k results on massive data. T2S first constructs the presorted table, whose tuples are arranged in the order of the round-robin retrieval on the sorted lists. T2S maintains only fixed number of tuples to compute results. The early termination checking for T2S is presented in this paper, along with the analysis of scan depth. The selective retrieval is devised to skip the tuples in the presorted table which are not top-k results. The theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly. The construction and incremental-update/batch-processing methods for the used structures are proposed. Efficient Top-k Retrieval on Massive Data
To its practical importance, top-k query has attracted extensive attention. The existing top-k algorithms can be classified into three types: indexbased methods view-based methods and sorted-list-based methods . Index-based methods (or view-based methods) make use of the pre-constructed indexes or views to process top-k query.
A concrete index or view is constructed on a specific subset of attributes, the indexes or views of exponential order with respect to attribute number have to be built to cover the actual queries, which is prohibitively expensive. The used indexes or views can only be built on a small and selective set of attribute combinations.
Sorted-list-based methods retrieve the sorted lists in a round-robin fashion, maintain the retrieved tuples, update their lower-bound and upper-bound scores. When the kth largest lower-bound score is not less than the upper-bound scores of other candidates, the k candidates with the largest lower-bound scores are top-k results.
Sorted-list-based methods compute topk results by retrieving the involved sorted lists and naturally can support the actual queries. However, it is analyzed in this paper that the numbers of tuples retrieved and maintained in these methods increase exponentially with attribute number, increase polynomially with tuple number and result size.
Ranking is a central part of many information retrieval problems, such as document retrieval, collaborative filtering, sentiment analysis, computational advertising (online ad placement).
Training data consists of queries and documents matching them together with relevance degree of each match. It may be prepared manually by human assessors (or raters, as Google calls them), who check results for some queries and determine relevance of each result. It is not feasible to check relevance of all documents, and so typically a technique called pooling is used only the top few documents, retrieved by some existing ranking models are checked.
Typically, users expect a search query to complete in a short time (such as a few hundred milliseconds for web search), which makes it impossible to evaluate a complex ranking model on each document in the corpus, and so a two-phase scheme is used.