Project - SemanticInformationRetrieval (SIR)

From MWStake
Jump to navigation Jump to search

Notes:

Preamble

This endeavor shall heed the following maxims:

Take the time to go fast.

Good architecture maximizes the number of decisions not made.

Write the code you wish you had.

Example

Intention/Goal/Purpose/Fundamentals (Current Thoughts)

Relevant information retrieval represents the primary human-computer interface towards our customers' end users. It is the ultimate end of our guild's service provision loop between knowledge workers and knowledge consumers.

SemanticInformationRetrieval (SIR) aims at providing a single Google-like search text box
(with subsequent facetted interactive search) covering all of an organization's structured
and unstructered knowledge resources for access with the least amount of mental overhead.

Based on this mission declaration, SIR shall consider the following idiosyncrasies and features:

  1. SIR provides its DSL one abstraction level above MediaWiki's API, as from an organization's perspective, SMW is one component of its comprehensive knowledge management system (albeit it might be its central component).
  2. SIR provides versatile polymorphic search interfaces for full-text search as well as (subsequent) facetted interactive search along subject affinities paths through semanticized information (EPPO-style knowledge topics).
  3. SIR does represent several designated MediaWiki extensions:
    • Extension:SIRIndexer providing a DSL for mapping and indexing both a page's factorized information in template instances as well as its free text.
    • Extension:SIRUserInterface providing an API for search interfaces on special pages and by parser functions.
    • Extension:SIRBackend providing fast "ES cached" page wikitext HTML delivery.
      • Such search interfaces shall be easily configured for covering any arbitrary facet of any complexity at the outset, i.e. like "searching in a namespace" but extended to "searching in a facet".
  4. Truly relevant full-text search cannot be achieved without tuning the analysis process both at index (defining features) and query (extracting signals) time. That's why SIR provides DSL methods that
    • e.g. in the case of Elasticsearch, take raw JSON as an input, and
    • are sufficiently high-level/concrete, so that improving search relevance for a customer's SMW becomes straightforward enough without low-level coding — e.g. adding and composing document fields (SIR DSL) and their analysis (e.g. Elasticsearch JSON input).
  5. SIR can be morphed into (S)MW's single source of truth, i.e. replacing its SQL backend.

Architectural Ideas (work in progress)

SDMS Architecture.png

  • When delivering a page, the web server assembles the response by normal MW functionality (caching!) but keeps a placeholder where the page's wikitext HTML would normally go. (https://www.mediawiki.org/wiki/Manual:Article.php ::view, would https://www.mediawiki.org/wiki/Manual:Hooks/ArticleFromTitle be an appropriate hook?)
    • The page's HTML is obtained from the ElasticSearch server.
    • Extension:SDMS has a configuration setting to override normal MW page delivery functionality accordingly.
  • When delivering a page in edit mode, no normal MW functionality is changed.
    • When receiving an edited page:
      1. Execute normal MW page store procedure.
      2. Update corresponding ElasticSearch page document with:
        • Non-analyzed raw HTML by parsing page wikitext into HTML
        • Full-text analyzed HTML
        • Structured page data by analyzing template calls (direct annotations are ignored as anti-best practice).