The XML becoming a ubiquitous language for data interoperability purposes in various domains, efficiently querying XML data is a critical issue. This has lead to the design of algebraic frameworks based on tree-shaped patterns akin to the tree-structured data model of XML. Tree patterns are graphic representations of queries over data trees. They are actually matched against an input data tree to answer a query.
Research effort has been focusing on tree pattern models and matching optimization (a primordial issue). This paper is a comprehensive survey of these topics, in which we outline and compare the various features of tree patterns. We also review and discuss the two main families of approaches for optimizing tree pattern matching, namely pattern tree minimization and holistic matching. We finally present actual tree pattern-based developments, to provide a global overview of this significant research topic. A Survey of XML Tree Patterns
Efficiently evaluating path expressions in a tree-structured data model such as XML’s is crucial for the overall performance of any query engine. Initial efforts that mapped XML documents into relational databases queried with SQL induced costly table joins. Thus, algebraic approaches based on tree-shaped patterns became popular for evaluating XML processing natively instead. Tree algebras indeed provide a formal framework for query expression and optimization, in a way similar to relational algebra with respect to the SQL language. In this context, a tree pattern (TP), also called pattern tree or tree pattern query (TPQ) in the literature, models a user query over a data tree. Simply put, a tree pattern is a graphic representation that provides an easy and intuitive way of specifying the interesting parts from an input data tree that must appear in query output. More formally, a TP is matched against a tree-structured database to answer a query. The upper left-hand side part of the figure features a simple XML document (a book catalog), and the lower left-hand side a sample XQuery that may be run against this document (and that returns the title and author of each book). The tree representations of the XML document and the associated query are featured on the upper and lower right-hand sides respectively. At the tree level, answering the query translates in matching the TP against the data tree. This process can be optimized and outputs a data tree that is eventually translated back as an XML document.
The input data tree when performing actual matching operations. The initial binary join-based approach for matching the tremendous number of holistic matching algorithms proposed in the literature, it is quite impossible to review them all. Hence, we aim in the following sections at presenting the most influential. Many labeling schemes have been proposed in the literature.
We focus in this section on the region encoding (or containment) and the Dewey ID (or prefix) labeling schemes that are used in holistic approaches. However, other approaches do exist, based on a tree-traversal order prime numbers or a combination of structural index and inverted lists [58], for instance. Various holistic algorithms actually achieve TP matching, but they all exploit a data list that, for each node, contains all labels of nodes of the same type.
We first review the approaches based on the region encoding scheme, which were first proposed, and then the approaches based on the Dewey ID scheme. These approaches aim at avoiding repeated access to the input data tree. Thus, they exploit structural summaries similar to the Data Guide proposed for semi structured documents.