See: Description
Class | Description |
---|---|
DocsFreqBlockIndexInput |
Implementation of the
BlockIndexInput for the .doc file of the SIREn
postings format. |
DocsFreqBlockIndexOutput |
Implementation of the
BlockIndexOutput for the .doc file of the SIREn
postings format. |
NodBlockIndexInput |
Implementation of the
BlockIndexInput for the .nod file of the SIREn
postings format. |
NodBlockIndexOutput |
Implementation of the
BlockIndexOutput for the .nod file of the SIREn
postings format. |
PosBlockIndexInput |
Implementation of the
BlockIndexInput for the .pos file of the SIREn
postings format. |
PosBlockIndexOutput |
Implementation of the
BlockIndexOutput for the .pos file of the SIREn
postings format. |
Siren10AForPostingsFormat |
Implementation of the
Siren10PostingsFormat based on AFOR. |
Siren10BlockStreamFactory |
This class creates
BlockIndexOutput and BlockIndexInput
for the SIREn 1.0 postings format. |
Siren10PostingsFormat |
Abstraction over the SIREn 1.0 postings format.
|
Siren10PostingsReader |
Reads the document identifiers, node frequencies, node labels, term
frequencies, term positions and block skip data.
|
Siren10PostingsWriter |
Writes the document identifiers, node frequencies, node labels, term
frequencies, term positions and block skip data.
|
Siren10SkipListReader |
Implementation of the
MultiLevelSkipListReader for the default
block-based posting list format of SIREn 1.0. |
Siren10SkipListWriter |
Implementation of the
MultiLevelSkipListWriter for the default
block-based posting list format of SIREn 1.0. |
Siren10VIntPostingsFormat |
Implementation of the
Siren10PostingsFormat based on VInt. |
org.apache.lucene.codecs.lucene40
package documentation.
Siren10PostingsFormat.DEFAULT_POSTINGS_BLOCK_SIZE
.
The file contains one postings lists for each term. The term is implicit and the position of the start of the postings list for a term is provided by the term dictionary.
One postings list is organised by block. Each block has a maximum size which determines the maximum number of document identifiers it can contain. It is possible that one block contain less document identifiers than its maximum size. For example, if a postings list contain only one document identifier, the size of the block will be one. The block format follows the schema:
Block = Header, CompressedDoc, CompressedNodeFreq Header = BlockSize, CompressedDocSize, CompressedNodeFreqSize, FirstDocId, LastDocId, NodeBlockPointer, PosBlockPointer CompressedDoc = [DeltaDocId] CompressedNodeFreq = [NodeFreq]BlockSize records the size of the block, i.e., the number of document identifiers.
CompressedDocLength records the size (in bytes) of the compressed byte array CompressedDoc.
CompressedNodeFreqLength records the size (in bytes) of the compressed byte array CompressedNodeFreq.
FirstDocId and LastDocId records the first and last document identifiers of the block. This information is used by the skip list algorithm. The LastDocId is encoded as delta between the FirstDocId.
NodeBlockPointer records the pointer of the node block in the .nod file that is associated to this block.
PosBlockPointer records the pointer of the position block in the .pos file that is associated to this block.
CompressedDoc is the compressed list of document identifiers. This list is compressed using the AFOR algorithm. The document identifiers are encoded as delta. The first document of this list is always 0 as it is encoded as delta with FirstDocId.
CompressedNodeFreq is the compressed list of node frequencies. This list is compressed using the AFOR algorithm. There is one node frequency per document identifier. The node frequency is encoded with a decrement of 1 to optimise AFOR compression: it gives a higher chance to get smaller bit frames.
The file is organised by block. Each block is synchronised with a .doc block, i.e., a single block contains all the node labels for the complete set of document identifiers contained in the .doc block. A block has a variable size which is determined by the number of node labels associated with the documents. Synchronising blocks across files simplifies encoding and decoding instructions and improves the performance.
The block format follows the schema:
Block = Header, CompressedNodeLength, CompressedNode, CompressedTermFreq Header = NodeLengthBlockSize, NodeBlockSize, TermFreqBlockSize, CompressedNodeLengthSize, CompressedNodeSize, CompressedTermFreqSize CompressedNodeLength = [NodeLength] CompressedNode = [DeltaNode] CompressedTermFreq = [TermFreq]NodeLengthBlockSize records the size of the block of the node lengths, i.e., the number of node lengths.
NodeBlockSize records the size of the node block, i.e., the number of integers composing the node labels.
TermFreqBlockSize records the size of the term frequency block, i.e., the number of term frequencies.
CompressedNodeLengthSize records the size (in bytes) of the compressed byte array CompressedNodeLength.
CompressedNodeSize records the size (in bytes) of the compressed byte array CompressedNode.
CompressedTermFreqSize records the size (in bytes) of the compressed byte array CompressedTermFreq.
CompressedNodeLength is the compressed list of node lengths. Since each node label can have a different length, the node length records the number of integers that composes a node label. This list is compressed using the AFOR algorithm. The node frequency is encoded with a decrement of 1 to optimise AFOR compression: it gives a higher chance to get smaller bit frames.
CompressedNode is the compressed list of node labels. This list is compressed using the AFOR algorithm. The node labels relative to a document are encoded as delta.
CompressedTermFreq is the compressed list of term frequencies. This list is compressed using the AFOR algorithm. There is one term frequency per node label. The node frequency is encoded with a decrement of 1 to optimise AFOR compression: it gives a higher chance to get smaller bit frames.
The file is organised by block. Each block is synchronised with a .nod block, i.e., a single block contains all the term positions for the complete set of node labels contained in the .nod block. A block has a variable size which is determined by the number of term positions associated with the nodes. Synchronising blocks across files simplifies encoding and decoding instructions and improves the performance.
The block format follows the schema:
Block = Header, CompressedTermPos Header = TermPosBlockSize, CompressedTermPosSize CompressedTermPos = [DeltaTermPos]TermPosBlockSize records the size of the term position block, i.e., the number of term positions.
CompressedTermPosSize records the size (in bytes) of the compressed byte array CompressedTermPos.
CompressedTermPos is the compressed list of term positions. This list is compressed using the AFOR algorithm. There is one or more term positions per node. The number of term positions per node is provided by the .nod block with the term frequency information. The term positions relative to a node are encoded as delta.
Siren10PostingsFormat.DEFAULT_POSTINGS_BLOCK_SIZE
.
Each skip entry points to the beginning of one block.
In contrast to the Lucene skip lists, part of the skip data is inlined within the .doc file. The pointers to the .nod block and .pos block associated to the .doc block are encoded in the header of the .doc file by NodeBlockPointer and PosBlockPointer.
For more information about our skip table approach over blocks, please refer to the publication SkipBlock: Self-indexing for Block-Based Inverted List.
Copyright © 2014. All rights reserved.