org.sindice.siren.search.node (siren-aggregator 1.0 API)

Class Summary
Class	Description
LuceneProxyNodeQuery	Class that act as a bridge between the SIREn query API and the Lucene query API.
MultiNodeTermQuery	An abstract `NodePrimitiveQuery` that matches documents containing a subset of terms provided by a `FilteredTermEnum` enumeration.
MultiNodeTermQuery.RewriteMethod	Abstract class that defines how the query is rewritten.
NodeAutomatonQuery	A `MultiNodeTermQuery` that will match terms against a finite-state machine.
NodeBooleanClause	A clause in a `NodeBooleanQuery`.
NodeBooleanQuery	A `NodeQuery` that matches a boolean combination of primitive queries, e.g., `NodeTermQuery`s, `NodePhraseQuery`s, `NodeBooleanQuery`s, ...
NodeConstantScoreQuery	A query that wraps another query or a filter and simply returns a constant score equal to the query boost for every nodes that matches the filter or query.
NodeFuzzyQuery	Implements the fuzzy search query.
NodeNumericRangeQuery<T extends Number>	A `NodePrimitiveQuery` that matches numeric values within a specified range.
NodePhraseQuery	A `NodePrimitiveQuery` that matches nodes containing a particular sequence of terms.
NodePrefixQuery	A `NodePrimitiveQuery` that matches documents containing terms with a specified prefix.
NodePrimitiveQuery	Abstraction over all the primitive `NodeQuery`s such as `NodeTermQuery`.
NodeQuery	Abstract class for the SIREn's node queries
NodeRegexpQuery	A `NodePrimitiveQuery` that provides a fast regular expression matching based on the `org.apache.lucene.util.automaton` package.
NodeScorer	The abstract `Scorer` class that defines the interface for iterating over an ordered list of nodes matching a query.
NodeTermQuery	A `NodePrimitiveQuery` that matches nodes containing a term.
NodeTermRangeQuery	A `NodePrimitiveQuery` that matches documents within an range of terms.
NodeWildcardQuery	A `NodePrimitiveQuery` that implements the wildcard search query.
TupleQuery	A `NodeQuery` that matches tuples, i.e., a boolean combination of `NodeQuery` having the same parent node.
TwigQuery	A `NodeQuery` that matches a boolean combination of Ancestor-Descendant and Parent-Child node queries.
TwigQuery.EmptyRootQuery	An empty root query is used to create twig query in which the root query is not specified.

Enum Summary
Enum Description

NodeBooleanClause.Occur
Specifies how clauses are to occur in matching documents.
Exception Summary
Exception Description

NodeBooleanQuery.TooManyClauses
Thrown when an attempt is made to add more than NodeBooleanQuery.getMaxClauseCount() clauses.

Enum Summary
Enum	Description
NodeBooleanClause.Occur	Specifies how clauses are to occur in matching documents.

Exception Summary
Exception	Description
NodeBooleanQuery.TooManyClauses	Thrown when an attempt is made to add more than `NodeBooleanQuery.getMaxClauseCount()` clauses.

Package org.sindice.siren.search.node Description

Programmatic API to search node-based inverted indexes.

Introduction

This package contains the API for building queries to search JSON data over node-based inverted indexes. For an introduction about the Lucene's search API, see the org.apache.lucene.search package documentation.

Search Basics

In contrast to the Lucene's Query API which provides complex querying capabilities to search for documents, SIREn provide a NodeQuery API to provide complex querying capabilities to search for nodes and documents. The information retrieved not only consists of the matching documents, but also of the matching nodes within these documents.

SIREn offers a wide variety of NodeQuery implementations. Most of them are similar to the ones provided by the Lucene's Query API. For example, while Lucene provides a TermQuery implementation to search documents that contain a specific term, SIREn provides a NodeTermQuery implementation to search nodes and documents that contain a specific term.

Level and Range Constraints

The NodeQuery provides methods to set constraints on the nodes matched by the query. There are two types of constraints:

Level constraint: this constraint will filter out all nodes that do not belong to the specified level of the tree.
Interval constraint: this constraint will filter out all nodes in which the last integer of their dewey code vector is not contained in the specified interval.

Query Classes

`NodeTermQuery`

A NodeTermQuery matches all the nodes that contain the specified Term, which is a word that occurs in a certain Field containing JSON data.

Constructing a NodeTermQuery is as simple as:

      NodeTermQuery tq = new NodeTermQuery(new Term("json-field", "term"));

In this example, the NodeQuery identifies all Documents that have the Field named "json-field" where a node contains the word "term".

`NodePhraseQuery` A `NodePhraseQuery` matches all the nodes containing the specified phrase. A phrase is defined as a sequence of `Term`.

`NodeBooleanQuery`

A NodeBooleanQuery matches all the nodes containing the specified boolean combination of queries. A NodeBooleanQuery contains multiple NodeBooleanClauses, where each clause contains a sub-query (NodePrimitiveQuery instance) and an operator (from NodeBooleanClause.Occur) describing how that sub-query is combined with the other clauses. The semantic of NodeBooleanClause.Occur is identical to the semantic of BooleanClause.Occur.

`NodeTermRangeQuery`

A NodeTermRangeQuery matches all nodes containing a term that occurs in the inclusive or exclusive range of a lower Term and an upper Term according to TermsEnum.getComparator(). It is not intended for numerical ranges; use NodeNumericRangeQuery instead.

`NodeNumericRangeQuery`

A NodeNumericRangeQuery matches all nodes containing a value that occurs in a numeric range. For NodeNumericRangeQuery to work, you must index the values with the datatypes configured with the appropriate numeric analyzers (NumericAnalyzer).

`NodePrefixQuery`, `NodeWildcardQuery`, `NodeRegexpQuery`

A NodePrefixQuery matches all nodes containing terms that begin with the specified string. A NodeWildcardQuery generalizes this by allowing for the use of + (matches 1 or more characters), * (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the NodeWildcardQuery can be quite slow. Also note that NodeWildcardQuery should not start with +, * and ?, as these are extremely slow. Some QueryParsers may not allow this by default, but provide a setAllowLeadingWildcard method to remove that protection. The NodeRegexpQuery is even more general than NodeWildcardQuery, matching all nodes with terms that match a regular expression pattern.

`NodeFuzzyQuery`

A NodeFuzzyQuery matches nodes that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance.

`TwigQuery`

A TwigQuery enables to combine NodeQuerys with a Parent-Child or Ancestor-Descendant relation. This is the basic building block to build tree-shaped queries.

A TwigQuery is composed of a root and of one or more children or descendants:

The root is a NodeQuery instance. An empty root is considered as a wildcard node query and will match all nodes. We call "root nodes" the set of nodes that are retrieved by the root query.
A descendant is a NodeQuery associated to an operator (from NodeBooleanClause.Occur). A descendant query will match all the nodes for which it exists a path to a root node. A descendant is associated to a node level, which corresponds to the relative distance (in term of levels) from the root.
A child is a descendant that is exactly one level above the root level.

A twig query is always associated to a level. If no level is specified, then by default the level is set to 1. When a twig query is used as a child or descendant of another twig query, then its level is automatically updated according to the level of the parent twig query. For example, given the following instructions:

      TwigQuery tw1 = new TwigQuery();
      TwigQuery tw2 = new TwigQuery();
      tw1.addChild(tw2, Occur.MUST);

In this example, the first twig query tw1 is defined at the default level 1. The second twig query tw2, after the call to TwigQuery.addChild(NodeQuery, org.sindice.siren.search.node.NodeBooleanClause.Occur), will have its level updated to 2 since it is now a child of a twig query at a level 1.

The Scorer Class

The NodeScorer abstract class provides common scoring functionality for all the node scorer implementations which are the heart of the SIREn scoring process.

The implementation of the query processing framework follows a node-at-a-time approach, where the query operators (i.e., NodeScorer) process one node at a time. The query processing framework has been designed for high efficiency processing:

All the query operators leverage has much as possible the lazy-loading feature of the Siren10PostingsReader. For example, there is not the concept of next matching document (i.e., DocIdSetIterator.nextDoc()) in the NodeScorer interface, but instead the concept of next candidate document (i.e., NodeScorer.nextCandidateDocument()). This enables NodeConjunctionScorer to efficiently iterates over the document identifiers wihtout having to decode the node labels until a potential candidate is found.
The node label array (i.e., IntsRef) being processed is the same in all the query operators, which means that the same array is reused across and no new arrays are created during the query processing.
The node label array is itself a slice of the array of the uncompressed node block. The node label array is created by sliding a window (i.e., IntsRef) over the array of the uncompressed node block.

Package org.sindice.siren.search.node

Package org.sindice.siren.search.node Description

Introduction

Search Basics

Level and Range Constraints

Query Classes

NodeTermQuery

NodePhraseQuery A NodePhraseQuery matches all the nodes containing the specified phrase. A phrase is defined as a sequence of Term.

NodeBooleanQuery

NodeTermRangeQuery

NodeNumericRangeQuery

NodePrefixQuery, NodeWildcardQuery, NodeRegexpQuery

NodeFuzzyQuery

TwigQuery