Package org.sindice.siren.analysis

Analyzer for indexing JSON content.

See: Description

Package org.sindice.siren.analysis Description

Analyzer for indexing JSON content.

Introduction

This package extends the Lucene's analysis API to provide support for parsing and indexing JSON content. For an introduction to Lucene's analysis API, see the org.apache.lucene.analysis package documentation.

Overview of the API

This package contains concrete components (Attributes, Tokenizers and TokenFilters) for analyzing different JSON content.

It also provides a pre-built JSON analyzer JsonAnalyzer that you can use to get started quickly.

It also contains a number of NumericAnalyzers that are used for supporting datatypes.

The SIREn's analysis API is divided into several packages:

JSON Analyzer

The JSON tokenizer parses the JSON data and converts it into an abstract tree model. The conversion is performed in a streaming mode during the parsing.

The tokenizer traverses the JSON tree using a depth-first search approach. During the traversal of the tree, the tokenizer increments the dewey code (i.e., node label) whenever an object, an array, a field or a value is encountered. The tokenizer attaches to any token generated the current node label using the NodeAttribute.

The tokenizer attaches also a datatype metadata to any token generated using the DatatypeAttribute. A datatype specifies the type of the data a node contains. By default, the tokenizer differentiates five datatypes in the JSON syntax:

The datatype metadata is used to perform an appropriate analysis of the content of a node. Such analysis is performed by the DatatypeAnalyzerFilter. The analysis of each datatype can be configured freely by the user using the method JsonAnalyzer.registerDatatype(char[], org.apache.lucene.analysis.Analyzer).

Custom datatypes can also be used thanks to a specific datatype JSON object. The schema of the datatype JSON object is the following:

 {
   "_datatype_" : <LABEL>,
   "_value_" : <VALUE>
 }
 
<LABEL> is a string which represents the name of the datatype to be applied on the associated value. <VALUE> is the associated value and is a string. The datatype JSON object is a way for passing custom datatypes to SIREn. It does not have influence on the label of the value node. For example, the label (i.e., 0.0) to the value b below:
 {
   "a" : "b"
 }
 
is the same for the value b with a custom datatype:
 {
   "a" : {
     "_datatype_" : "my datatype",
     "_value_" : "b"
   }
 }
 

The Lucene's payload interface is used by SIREn to encode information such as the node label and the position of the token. This payload is then decoded by the index API and encoded back into the node-based inverted index data structure.

Copyright © 2014. All rights reserved.