Class | Description |
---|---|
AnyURIAnalyzer |
Analyzer designed to deal with any kind of URIs and perform some
post-processing on URIs.
|
DoubleNumericAnalyzer |
An implementation of the
NumericAnalyzer for double value. |
FloatNumericAnalyzer |
An implementation of the
NumericAnalyzer for float value. |
IntNumericAnalyzer |
An implementation of the
NumericAnalyzer for integer value. |
JsonAnalyzer |
The JsonAnalyzer is especially designed to process JSON data.
|
JsonTokenizer |
A tokenizer for data following the JSON syntax.
|
JsonTokenizerImpl |
Json document scanner
|
LongNumericAnalyzer |
An implementation of the
NumericAnalyzer for long value. |
NumericAnalyzer |
Abstraction over the analyzer for numeric datatype.
|
NumericTokenizer |
This class provides a TokenStream for indexing numeric values that is used in
NumericAnalyzer . |
TupleAnalyzer | Deprecated
Use
JsonAnalyzer instead |
TupleTokenizer | Deprecated
Use
JsonTokenizer instead |
Enum | Description |
---|---|
AnyURIAnalyzer.URINormalisation |
Types of URI normalisation
|
org.apache.lucene.analysis
package documentation.
Attribute
s,
Tokenizer
s and
TokenFilter
s) for analyzing different
JSON content.
It also provides a pre-built JSON analyzer
JsonAnalyzer
that you can use to get
started quickly.
It also contains a number of
NumericAnalyzer
s that are used for
supporting datatypes.
The SIREn's analysis API is divided into several packages:
org.sindice.siren.analysis.attributes
contains a number of
Attribute
s that are used to add metadata
to a stream of tokens.
org.sindice.siren.analysis.filter
contains a number of
TokenFilter
s that alter incoming tokens.
JSON tokenizer
parses
the JSON data and converts it into an abstract tree model. The conversion
is performed in a streaming mode during the parsing.
The tokenizer traverses the JSON tree using a depth-first search approach.
During the traversal of the tree, the tokenizer increments the dewey code
(i.e., node label) whenever an object, an array, a field or a value
is encountered. The tokenizer attaches to any token generated the current
node label using the
NodeAttribute
.
The tokenizer attaches also a datatype metadata to any token generated using
the DatatypeAttribute
.
A datatype specifies the type of the data a node contains. By default, the
tokenizer differentiates five datatypes in the JSON syntax:
XSDDatatype.XSD_STRING
XSDDatatype.XSD_LONG
XSDDatatype.XSD_DOUBLE
XSDDatatype.XSD_BOOLEAN
JSONDatatype.JSON_FIELD
DatatypeAnalyzerFilter
. The
analysis of each datatype can be configured freely by the user using the
method
JsonAnalyzer.registerDatatype(char[], org.apache.lucene.analysis.Analyzer)
.
Custom datatypes can also be used thanks to a specific datatype JSON object. The schema of the datatype JSON object is the following:
{ "_datatype_" : <LABEL>, "_value_" : <VALUE> }
<LABEL>
is a string which represents the name of the datatype to be
applied on the associated value.
<VALUE>
is the associated value and is a string.
The datatype JSON object is a way for passing custom datatypes to SIREn.
It does not have influence on the label of the value node.
For example, the label (i.e., 0.0
) to the value b
below:
{ "a" : "b" }is the same for the value
b
with a custom datatype:
{ "a" : { "_datatype_" : "my datatype", "_value_" : "b" } }
The Lucene's
payload
interface is used by SIREn to encode information such as the node label and
the position of the token. This payload is then decoded by the
index API
and encoded back into the node-based
inverted index data structure.
Copyright © 2014. All rights reserved.