org.sindice.siren.analysis (siren-aggregator 1.0 API)

Class Summary
Class	Description
AnyURIAnalyzer	Analyzer designed to deal with any kind of URIs and perform some post-processing on URIs.
DoubleNumericAnalyzer	An implementation of the `NumericAnalyzer` for double value.
FloatNumericAnalyzer	An implementation of the `NumericAnalyzer` for float value.
IntNumericAnalyzer	An implementation of the `NumericAnalyzer` for integer value.
JsonAnalyzer	The JsonAnalyzer is especially designed to process JSON data.
JsonTokenizer	A tokenizer for data following the JSON syntax.
JsonTokenizerImpl	Json document scanner
LongNumericAnalyzer	An implementation of the `NumericAnalyzer` for long value.
NumericAnalyzer	Abstraction over the analyzer for numeric datatype.
NumericTokenizer	This class provides a TokenStream for indexing numeric values that is used in `NumericAnalyzer`.
TupleAnalyzer	Deprecated Use `JsonAnalyzer` instead
TupleTokenizer	Deprecated Use `JsonTokenizer` instead

Enum Summary
Enum Description

AnyURIAnalyzer.URINormalisation
Types of URI normalisation

Enum Summary
Enum	Description
AnyURIAnalyzer.URINormalisation	Types of URI normalisation

Package org.sindice.siren.analysis Description

Analyzer for indexing JSON content.

Introduction

This package extends the Lucene's analysis API to provide support for parsing and indexing JSON content. For an introduction to Lucene's analysis API, see the org.apache.lucene.analysis package documentation.

Overview of the API

This package contains concrete components (Attributes, Tokenizers and TokenFilters) for analyzing different JSON content.

It also provides a pre-built JSON analyzer JsonAnalyzer that you can use to get started quickly.

It also contains a number of NumericAnalyzers that are used for supporting datatypes.

The SIREn's analysis API is divided into several packages:

org.sindice.siren.analysis.attributes contains a number of Attributes that are used to add metadata to a stream of tokens.
org.sindice.siren.analysis.filter contains a number of TokenFilters that alter incoming tokens.

JSON Analyzer

The JSON tokenizer parses the JSON data and converts it into an abstract tree model. The conversion is performed in a streaming mode during the parsing.

The tokenizer traverses the JSON tree using a depth-first search approach. During the traversal of the tree, the tokenizer increments the dewey code (i.e., node label) whenever an object, an array, a field or a value is encountered. The tokenizer attaches to any token generated the current node label using the NodeAttribute.

The tokenizer attaches also a datatype metadata to any token generated using the DatatypeAttribute. A datatype specifies the type of the data a node contains. By default, the tokenizer differentiates five datatypes in the JSON syntax:

The datatype metadata is used to perform an appropriate analysis of the content of a node. Such analysis is performed by the DatatypeAnalyzerFilter. The analysis of each datatype can be configured freely by the user using the method JsonAnalyzer.registerDatatype(char[], org.apache.lucene.analysis.Analyzer).

Custom datatypes can also be used thanks to a specific datatype JSON object. The schema of the datatype JSON object is the following:

 {
   "_datatype_" : <LABEL>,
   "_value_" : <VALUE>
 }

<LABEL> is a string which represents the name of the datatype to be applied on the associated value. <VALUE> is the associated value and is a string. The datatype JSON object is a way for passing custom datatypes to SIREn. It does not have influence on the label of the value node. For example, the label (i.e., 0.0) to the value b below:

 {
   "a" : "b"
 }

is the same for the value b with a custom datatype:

 {
   "a" : {
     "_datatype_" : "my datatype",
     "_value_" : "b"
   }
 }

The Lucene's payload interface is used by SIREn to encode information such as the node label and the position of the token. This payload is then decoded by the index API and encoded back into the node-based inverted index data structure.