Project - SemanticInformationRetrieval (SIR)

From MWStake
Revision as of 01:39, 9 August 2017 by Lex Sulzer (talk | contribs) (Added link to https://github.com/MWStake/SemanticInformationRetrieval)
Jump to navigation Jump to search

Notes:

Preamble

This endeavor shall heed the following maxims:

Take the time to go fast.

Good architecture maximizes the number of decisions not made.

Write the code you wish you had.

Intention/Goal/Purpose/Fundamentals (Current Thoughts)

SemanticInformationRetrieval (SIR) aims at providing a single Google-like search text box
(with subsequent facetted interactive search) covering all of an organization's structured
and unstructered knowledge resources for access with the least amount of mental overhead.

Based on this mission declaration, SIR shall consider the following idiosyncrasies and features:

  1. SIR provides its DSL one abstraction level above MediaWiki's API, as from an organization's perspective, SMW is one component of its comprehensive knowledge management system (albeit it might be its central component).
  2. SIR provides versatile polymorphic search interfaces for full-text search as well as (subsequent) facetted interactive search along subject affinities paths through semanticized information.
  3. SIR does not represent a single designated MediaWiki extension, but provides two independent MW extensions:
    • Extension:SIRIndexer providing an DSL for mapping and indexing both a page's factorized information in template instances as well as its free text.
    • Extension:SIRSearchInterface providing an API for search interfaces on a special pages and by parser functions.
      • Such search interfaces shall be easily configured for covering any arbitrary facet of any complexity at the outset, i.e. like "searching in a namespace" but extended to "searching in a facet".
  4. Truly relevant full-text search cannot be achieved without tuning the analysis process both at index (defining features) and query (extracting signals) time. That's why SIR provides DSL methods that
    • e.g. in the case of Elasticsearch, take raw JSON as an input, and
    • are sufficiently high-level/concrete, so that improving search relevance for a customer's SMW becomes straightforward enough without low-level coding — e.g. adding and composing document fields (SIR DSL) and their analysis (e.g. Elasticsearch JSON input).