Package org.sindice.siren.index.codecs.siren10

Implementation of the encoding and decoding of the block-based postings format for SIREn 1.0.

See: Description

Package org.sindice.siren.index.codecs.siren10 Description

Implementation of the encoding and decoding of the block-based postings format for SIREn 1.0.

Introduction

This package contains the implementation of the encoding and decoding of the SIREn 1.0 block-based postings format. This postings format is compatible with the Lucene 4.0 codec. For an introduction to Lucene 4.0 codec API, see the org.apache.lucene.codecs.lucene40 package documentation.

SIREn 1.0 Postings Format

The SIREn 1.0 postings format is organised around four files: The SIREn 1.0 postings format is divided into multiple blocks. The default block size is defined by Siren10PostingsFormat.DEFAULT_POSTINGS_BLOCK_SIZE.

Documents and Node Frequencies

The .doc file contains a list of document identifiers for each term along with the node frequency of the term in that document. The document identifiers are ordered by increasing number.

The file contains one postings lists for each term. The term is implicit and the position of the start of the postings list for a term is provided by the term dictionary.

One postings list is organised by block. Each block has a maximum size which determines the maximum number of document identifiers it can contain. It is possible that one block contain less document identifiers than its maximum size. For example, if a postings list contain only one document identifier, the size of the block will be one. The block format follows the schema:

   Block  = Header, CompressedDoc, CompressedNodeFreq
   Header = BlockSize,
            CompressedDocSize, CompressedNodeFreqSize,
            FirstDocId, LastDocId,
            NodeBlockPointer, PosBlockPointer
   CompressedDoc      = [DeltaDocId]
   CompressedNodeFreq = [NodeFreq]
 
BlockSize records the size of the block, i.e., the number of document identifiers.

CompressedDocLength records the size (in bytes) of the compressed byte array CompressedDoc.

CompressedNodeFreqLength records the size (in bytes) of the compressed byte array CompressedNodeFreq.

FirstDocId and LastDocId records the first and last document identifiers of the block. This information is used by the skip list algorithm. The LastDocId is encoded as delta between the FirstDocId.

NodeBlockPointer records the pointer of the node block in the .nod file that is associated to this block.

PosBlockPointer records the pointer of the position block in the .pos file that is associated to this block.

CompressedDoc is the compressed list of document identifiers. This list is compressed using the AFOR algorithm. The document identifiers are encoded as delta. The first document of this list is always 0 as it is encoded as delta with FirstDocId.

CompressedNodeFreq is the compressed list of node frequencies. This list is compressed using the AFOR algorithm. There is one node frequency per document identifier. The node frequency is encoded with a decrement of 1 to optimise AFOR compression: it gives a higher chance to get smaller bit frames.

Node Labels and Term Frequencies

The .nod file contains a list of node labels for each document along with the length of each node labels, and the frequency of the term for each node. The node labels are ordered by increasing dewey codes.

The file is organised by block. Each block is synchronised with a .doc block, i.e., a single block contains all the node labels for the complete set of document identifiers contained in the .doc block. A block has a variable size which is determined by the number of node labels associated with the documents. Synchronising blocks across files simplifies encoding and decoding instructions and improves the performance.

The block format follows the schema:

   Block  = Header, CompressedNodeLength, CompressedNode, CompressedTermFreq
   Header = NodeLengthBlockSize, NodeBlockSize, TermFreqBlockSize,
            CompressedNodeLengthSize, CompressedNodeSize, CompressedTermFreqSize
   CompressedNodeLength = [NodeLength]
   CompressedNode       = [DeltaNode]
   CompressedTermFreq   = [TermFreq]
 
NodeLengthBlockSize records the size of the block of the node lengths, i.e., the number of node lengths.

NodeBlockSize records the size of the node block, i.e., the number of integers composing the node labels.

TermFreqBlockSize records the size of the term frequency block, i.e., the number of term frequencies.

CompressedNodeLengthSize records the size (in bytes) of the compressed byte array CompressedNodeLength.

CompressedNodeSize records the size (in bytes) of the compressed byte array CompressedNode.

CompressedTermFreqSize records the size (in bytes) of the compressed byte array CompressedTermFreq.

CompressedNodeLength is the compressed list of node lengths. Since each node label can have a different length, the node length records the number of integers that composes a node label. This list is compressed using the AFOR algorithm. The node frequency is encoded with a decrement of 1 to optimise AFOR compression: it gives a higher chance to get smaller bit frames.

CompressedNode is the compressed list of node labels. This list is compressed using the AFOR algorithm. The node labels relative to a document are encoded as delta.

CompressedTermFreq is the compressed list of term frequencies. This list is compressed using the AFOR algorithm. There is one term frequency per node label. The node frequency is encoded with a decrement of 1 to optimise AFOR compression: it gives a higher chance to get smaller bit frames.

Term Positions

The .pos file contains the list of term positions within nodes. The term positions are ordered by increasing number.

The file is organised by block. Each block is synchronised with a .nod block, i.e., a single block contains all the term positions for the complete set of node labels contained in the .nod block. A block has a variable size which is determined by the number of term positions associated with the nodes. Synchronising blocks across files simplifies encoding and decoding instructions and improves the performance.

The block format follows the schema:

   Block  = Header, CompressedTermPos
   Header = TermPosBlockSize,
            CompressedTermPosSize
   CompressedTermPos  = [DeltaTermPos]
 
TermPosBlockSize records the size of the term position block, i.e., the number of term positions.

CompressedTermPosSize records the size (in bytes) of the compressed byte array CompressedTermPos.

CompressedTermPos is the compressed list of term positions. This list is compressed using the AFOR algorithm. There is one or more term positions per node. The number of term positions per node is provided by the .nod block with the term frequency information. The term positions relative to a node are encoded as delta.

Skip Lists

The .skp file contains the skip data. The structure of the skip table is quite similar to Lucene40PostingsFormat. However, the skip data is defined around the concept of block instead of document. The skip interval defines the number of blocks between each skip data. The default skip interval is defined by Siren10PostingsFormat.DEFAULT_POSTINGS_BLOCK_SIZE. Each skip entry points to the beginning of one block.

In contrast to the Lucene skip lists, part of the skip data is inlined within the .doc file. The pointers to the .nod block and .pos block associated to the .doc block are encoded in the header of the .doc file by NodeBlockPointer and PosBlockPointer.

For more information about our skip table approach over blocks, please refer to the publication SkipBlock: Self-indexing for Block-Based Inverted List.

Interaction with the Postings List

The reading of the SIREn 1.0 postings format relies on the lazy-loading approach. Any information, e.g., term positions, term frequencies, node labels and node frequencies, that are not requested explicitly are (1) never decoded, and (2) not read from disk whenever possible.

Copyright © 2014. All rights reserved.