Package org.sindice.siren.search.node

Programmatic API to search node-based inverted indexes.

See: Description

Package org.sindice.siren.search.node Description

Programmatic API to search node-based inverted indexes.

Introduction

This package contains the API for building queries to search JSON data over node-based inverted indexes. For an introduction about the Lucene's search API, see the org.apache.lucene.search package documentation.

Search Basics

In contrast to the Lucene's Query API which provides complex querying capabilities to search for documents, SIREn provide a NodeQuery API to provide complex querying capabilities to search for nodes and documents. The information retrieved not only consists of the matching documents, but also of the matching nodes within these documents.

SIREn offers a wide variety of NodeQuery implementations. Most of them are similar to the ones provided by the Lucene's Query API. For example, while Lucene provides a TermQuery implementation to search documents that contain a specific term, SIREn provides a NodeTermQuery implementation to search nodes and documents that contain a specific term.

Level and Range Constraints

The NodeQuery provides methods to set constraints on the nodes matched by the query. There are two types of constraints:

Query Classes

NodeTermQuery

A NodeTermQuery matches all the nodes that contain the specified Term, which is a word that occurs in a certain Field containing JSON data.

Constructing a NodeTermQuery is as simple as:

      NodeTermQuery tq = new NodeTermQuery(new Term("json-field", "term"));
 
In this example, the NodeQuery identifies all Documents that have the Field named "json-field" where a node contains the word "term".

NodePhraseQuery A NodePhraseQuery matches all the nodes containing the specified phrase. A phrase is defined as a sequence of Term.

NodeBooleanQuery

A NodeBooleanQuery matches all the nodes containing the specified boolean combination of queries. A NodeBooleanQuery contains multiple NodeBooleanClauses, where each clause contains a sub-query (NodePrimitiveQuery instance) and an operator (from NodeBooleanClause.Occur) describing how that sub-query is combined with the other clauses. The semantic of NodeBooleanClause.Occur is identical to the semantic of BooleanClause.Occur.

NodeTermRangeQuery

A NodeTermRangeQuery matches all nodes containing a term that occurs in the inclusive or exclusive range of a lower Term and an upper Term according to TermsEnum.getComparator(). It is not intended for numerical ranges; use NodeNumericRangeQuery instead.

NodeNumericRangeQuery

A NodeNumericRangeQuery matches all nodes containing a value that occurs in a numeric range. For NodeNumericRangeQuery to work, you must index the values with the datatypes configured with the appropriate numeric analyzers (NumericAnalyzer).

NodePrefixQuery, NodeWildcardQuery, NodeRegexpQuery

A NodePrefixQuery matches all nodes containing terms that begin with the specified string. A NodeWildcardQuery generalizes this by allowing for the use of + (matches 1 or more characters), * (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the NodeWildcardQuery can be quite slow. Also note that NodeWildcardQuery should not start with +, * and ?, as these are extremely slow. Some QueryParsers may not allow this by default, but provide a setAllowLeadingWildcard method to remove that protection. The NodeRegexpQuery is even more general than NodeWildcardQuery, matching all nodes with terms that match a regular expression pattern.

NodeFuzzyQuery

A NodeFuzzyQuery matches nodes that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance.

TwigQuery

A TwigQuery enables to combine NodeQuerys with a Parent-Child or Ancestor-Descendant relation. This is the basic building block to build tree-shaped queries.

A TwigQuery is composed of a root and of one or more children or descendants:

A twig query is always associated to a level. If no level is specified, then by default the level is set to 1. When a twig query is used as a child or descendant of another twig query, then its level is automatically updated according to the level of the parent twig query. For example, given the following instructions:

      TwigQuery tw1 = new TwigQuery();
      TwigQuery tw2 = new TwigQuery();
      tw1.addChild(tw2, Occur.MUST);
 
In this example, the first twig query tw1 is defined at the default level 1. The second twig query tw2, after the call to TwigQuery.addChild(NodeQuery, org.sindice.siren.search.node.NodeBooleanClause.Occur), will have its level updated to 2 since it is now a child of a twig query at a level 1.

The Scorer Class

The NodeScorer abstract class provides common scoring functionality for all the node scorer implementations which are the heart of the SIREn scoring process.

The implementation of the query processing framework follows a node-at-a-time approach, where the query operators (i.e., NodeScorer) process one node at a time. The query processing framework has been designed for high efficiency processing:

  1. All the query operators leverage has much as possible the lazy-loading feature of the Siren10PostingsReader. For example, there is not the concept of next matching document (i.e., DocIdSetIterator.nextDoc()) in the NodeScorer interface, but instead the concept of next candidate document (i.e., NodeScorer.nextCandidateDocument()). This enables NodeConjunctionScorer to efficiently iterates over the document identifiers wihtout having to decode the node labels until a potential candidate is found.
  2. The node label array (i.e., IntsRef) being processed is the same in all the query operators, which means that the same array is reused across and no new arrays are created during the query processing.
  3. The node label array is itself a slice of the array of the uncompressed node block. The node label array is created by sliding a window (i.e., IntsRef) over the array of the uncompressed node block.

Copyright © 2014. All rights reserved.