Project - SemanticInformationRetrieval (SIR)
Jump to navigation
Jump to search
Notes:
- This project is suggested based on a discussion held at MWStake's August 2017 monthly meeting (see line 113 on https://etherpad.wikimedia.org/p/mwstake-2017-08).
- It is the subject of a workshop at the upcoming SMWCon Fall 2017: SMWCon_Fall_2017/Taking the first step towards a Semantic Information Retrieval System (SIRS).
- This is a working page representing consolidated current views. It is therefore subject to frequent changes.
Preamble
This endeavor shall heed the following maxims:
Take the time to go fast. Good architecture maximizes the number of decisions not made. Write the code you wish you had.
Intention/Goal/Purpose/Fundamentals (Current Thoughts)
SemanticInformationRetrieval (SIR) aims at providing a single Google-like search text box (with subsequent facetted interactive search) covering all of an organization's structured and unstructered knowledge resources for access with the least amount of mental overhead.
Based on this mission declaration, SIR shall consider the following idiosyncrasies and features:
- SIR provides its DSL one abstraction level above MediaWiki's API, as from an organization's perspective, SMW is one component of its comprehensive knowledge management system (albeit it might be its central component).
- SIR provides versatile polymorphic search interfaces for full-text search as well as (subsequent) facetted interactive search along subject affinities paths through semanticized information.
- SIR does not represent a single designated MediaWiki extension, but provides two independent MW extensions:
- Extension:SIRIndexer providing an DSL for mapping and indexing both a page's factorized information in template instances as well as its free text.
- Extension:SIRSearchInterface providing an API for search interfaces on a special pages and by parser functions.
- Truly relevant full-text search cannot be achieved without tuning the analysis process both at index (defining features) and query (extracting signals) time. That's why SIR provides DSL methods that
- e.g. in the case of Elasticsearch, take raw JSON as an input, and
- are sufficiently high-level/concrete, so that improving search relevance for a customer's SMW becomes straightforward enough without low-level coding — e.g. adding and composing document fields (SIR DSL) and their analysis (e.g. Elasticsearch JSON input).