SIREn

Semi-Structured Information Retrieval Engine

This academic project is not maintained anymore. For updates and commercial support, please refer to SIREn Solutions.

Download .zip Download .tar.gz View API docs View on GitHub

Overview

SIREn is a Lucene/Solr extension for efficient schemaless semi-structured full-text search. SIREn is not a complete application by itself, but rather a code library and API that can easily be used to create a full-featured semi-structured search engine.

Efficient, large scale handling of semi-structured data is an increasingly important issue in information search scenarios on the web as well as in the enterprise.

While Lucene has long offered these capabilities, its native capabilities are not intended for collections of schemaless semi-structured documents, e.g., collections where the schema varies across documents or collections with a complex schema and a complex nested structure. For this reason we have developed SIREn, a Lucene/Solr plugin to overcome these shortcomings and efficiently index and query complex JSON documents with arbitrary schema.

For its features, SIREn can be seen as being halfway between Solr (of which it offers all the search features) and MongoDB (given it can index arbitrary JSON documents).

Learn more about SIREn

You can find a detailed description of SIREn's architecture and API in the Java Documentation of the project. You can also watch a recent talk about SIREn at the Lucene Revolution 2013 conference:

Finally, you can read one of our scientific publications for details about the data model and algorithms behind SIREn.

Reference

If you are using SIREn for your scientific work, please cite the following article:

Renaud Delbru, Stephane Campinas, Giovanni Tummarello, Searching web data: An entity retrieval and high-performance indexing model, In Web Semantics: Science, Services and Agents on the World Wide Web, ISSN 1570-8268, 10.1016/j.websem.2011.04.004.

Community

Mailing List

SIREn-User is a mailing list that you can join to seek help, discuss about possible improvements, etc.

License

SIREn is open-source under the Apache 2 License.

Issues

You can report issues at the GitHub issue tracker.

Acknowledgements

The SIREn project is based upon works supported by: