Full-Text Indexing and Search


    Search engine

    CMS Full-Text Indexing and Search Basics | Overview | Documentation (image)

    KUSoftas full-text indexing and search engine (in this document Search engine) is enterprise search platform designed to be used in the shared hosting. It is pure PHP application and does not require standalone full-text search server with servlet container such as Apache Solr or Elasticsearch (both are based on Apache Lucene). Search engine is well integrated into CMS as well as it allows easy to create various search indexes that fits customer needs. Low cost of ownership based on very cheap shared hosting targets small business companies also it is very important for bigger enterprises too.

    Search engine brings enterprise full-text indexing and search availability for non expensive systems that was not possible using Apache Solr or Elasticsearch yet.

    Search engine structure:

    • Sites. Search engine can be used for:
      • Site search
      • E-commerce catalog search and filtering
      • Library system catalog search and A-z browsing
      • Other
    • Tools. More info about Search engine tools.
    • Other. Index update may be done by direct call of RESTful API (example Linux Cron). Third party applicatons may use RESTful API too.
    • RESTful API. All interactions with Search engine may be done using this API only. More info about Search engine RESTful API.
    • Indexing. Site pages, plugin data or other external items may be indexed.
    • Search. Search can be done using keyword, phrase or expression using Apache Lucene Query Syntax. KUSoftas CMS currently does not support: proximity and regular expression search). Index may be browsed by words in A-z order like in library system.
    • Management. This is backed for Search engine tools.
    • DB and FSA. All index data are stored into RDBMS shema. Fuzzy search use Levenshtein FSA (finite state automata).

    Indexing process

    CMS Full-Text Indexing and Search Basics | Overview | Documentation (image)

    Both site pages and external source items may be indexed using Search engine. That can be done in 2 ways:

    • Index creation (using Search engine tools or RESTful API directly). Index content will be fully recreated in this case. Only those site pages which are enabled (using CMS Pages Tool) for indexing will be indexed. CMS Plugins must have implemented special class content with namespace cms\plugin\plugin-name with special method updateIndex($id, $key) in order to response to index creation reaquest with external items. Method parameters are:
      • id - language code
      • key - index key; NULL - default index
      Method must respond with associative array of external data with structure decribed below (Index update). Class implementation sample cms\plugin\news\content::updateIndex is available in CMS News plugin source.
    • Index update. In this case indexing process will be done in 2 steps: queue and update (using Search engine tools or Search engine RESTful API directly). Queued may be site (on page content update) and external items (ex. CMS News Plugin). External items must contain data:
      • seq - item unique id
      • action - add/remove
      • ext - true
      • label - item title
      • title - item title
      • description - item description
      • keywords - item keywords
      • content - item content
      • language - item language
      • url - item URL
      • code - item classification code
      • creator - item creator name
      • created - item creation date
    • Analysis. Item content will be extracted using indexing task. Indexing task is PHP extended XSLT sheet to process XSLT transformation from item content to indexing statements. Indexing task sample - this site indexing task. XSLT sheet contains variables:
      • objLabel - item label
      • objTitle - item title
      • objDescription - item decription
      • objKeywords - item keywords
      • objLanguage - item language
      • objSEQ - item unique id
      • objPID - item pernament unique ID
      • objCode - item classification code
      • objModel - item model ID
      • objURL - item URL
      • objCreator - item creator
      • objCreateDate - item creation date
      • objEditor - item editor
      • objLastModDate - item last modification date
      • objBoost - item boost number (default = 1.0)
      Indexing statements:
      • IndexDocument. Properties:
        • pid - pernament unique ID
        • seq - internal unique ID
        • boost - boost value
      • IndexField. Properties:
        • IFname - index field code
        • index - UN_TOKENIZED/TOKENIZED (if TOKENIZED extract words)
        • store - YES/NO (if YES - store text into index)
        • termVector - YES/NO (if YES - use in search results score calculation)
      Functions injected by Search engine:
      • php:function('getCMSData', 'objectTopic') - get item top classification codes
      • php:function('getCMSData', 'objectText') - get item body text
      • php:function('getCMSData', 'streamXML', string(@dsid)) - get item digital object datastream XML content by datastream ID
      • php:function('getCMSData', 'catalogXML') - get catalogue (ex. eShop) XML content
      • php:function('getCMSData', 'relationXML') - get item digital object relation with external system (ex. eShop) XML content
      • php:function('getCMSData', 'streamText', string(@dsid)) - get item digital object datastream text content by datastream ID
      Transformation process XSLT sheet over digital object datastreams (if exist) received in XML format using RESTful API call /objects/:pid/datastreams - Get digital object datastreams list. Content will be tokenized (if statement IndexField property index contains value TOKENIZED) using stop (remove stop words) and boost (assign boost values for certain words) words. Samples of stop and boost words files: stop words, boost words.
    • Index DB. Indexed data are stored into RDBMS.
    • Levenshtein FSA. Tokenized words and un-tokenized phrases are stored into Levenshtein finite state automata (FSA) to implement fuzzy search. Implementation of Levenshtein FSA is based on article Fast String Correction with Levenshtein-Automata. Klaus U. Schulz. CIS. University of Munich. schulz@cis.uni-muenchen.de. Stoyan Mihov.


    Search process

    CMS Full-Text Indexing and Search Basics | Overview | Documentation (image)
    • Search query. Search query may contain keyword, phrase or expression. Search engine support Apache Lucene Query Syntax:
      • Terms. A query consists of terms and operators. There are two types of terms: words (ex. object) and phrases. A phrase is a group of words surrounded by double quotes such as "digital object". Untokenized terms may by passed surrounded by square brackets and double quotes as ["digital object"]. Multiple terms can be combined together with Boolean operators.
      • Fields. When performing a search you can either specify a field, or use the default field. Field and word are separated by a colon, ex. doc.text:object
      • Wildcard searches. To perform a single character wildcard search use the "?" symbol, ex. objec?, obje*
      • Fuzzy searches. Search engine supports fuzzy searches based on Levenshtein distance. An additional (optional) parameter can specify the maximum number of edits allowed. The value is between 0 and 2 (default is 1). To do a fuzzy search use the tilde, "~", symbol at the end of word. Ex. objec~, objec~1, objec~2.
      • Range searches. Range searches allow one to match documents whose field(s) values are between the lower and upper bound specified by the range Query. Range searches can be inclusive or exclusive of the upper and lower bounds. Ex. [2010-01-01 TO 2015-01-01] (inclusive), {2010-01-01 TO 2015-01-01} (exclusive).
      • Words boosting. To boost a word use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the word will be in search results scoring. Ex. object^5.
      • Boolean operators. Boolean operators allow terms to be combined through logic operators: AND, "+", OR, NOT and "-" (boolean operators must be ALL CAPS):
        • OR. The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. The OR operator links two terms and finds a matching item if either of the terms exist in a item. The symbol || can be used in place of the word OR. Ex. digital object, digital OR object.
        • AND. The AND operator matches items where both terms exist anywhere in the text of a single item. The symbol && can be used in place of the word AND. Ex. digital AND object.
        • +. The "+" or required operator requires that the term after the "+" symbol exist somewhere in a the field of a single item. Ex. digital +object.
        • NOT. The NOT operator excludes documents that contain the term after NOT. The symbol ! can be used in place of the word NOT. Ex. digital NOT object.
        • -. The "-" or prohibit operator excludes items that contain the term after the "-" symbol. Ex. digital -object.
      • Grouping. Search engine supports using parentheses to group clauses to form sub queries. Ex. (digital AND object) OR page.
      Search engine currently does not support: proximity and regular expression search.
      There is special kind of search - browse terms. This is library system style A-z terms browsing. To start browsing enter empty, word or part of word to locate start position of term browsing.
    • Parse. Search engine query parser parse query expression and remove stop words. Parser asks Levenshtein FSA for words in Levenshtein distance if fuzzy search found in query. Cache may be used to speed up this process. Parser result is RDBMS SQL Query to do actual search in Index DB.
    • Results. Search and browse results may be in following format: For details see Search engine RESTful API. Search and browse options and formats may be tested in Search engine tool demo. Search with table format of results can be very simply and quickly implemented into any site. Test it at http://www.kusoftas.com/search.
    • Results ordering. Default is defined in Search engine configuration. Fields keys and predefined word "score" may be used. Add coma separated word "reverse" for descending order. Separate with ";" ordering fields, ex.: doc.title;score,reverse. Table format allow to order results by result column in addition.

    Faceted search

    CMS Full-Text Indexing and Search Basics | Overview | Documentation (image)

    Faceted navigation provides multiple filters, one for each different aspect of the content. For example this can be used to create faceted filters in eShop. KUSoftas faceted search iplementation supports 2 types of facets:

    • Minmax. It can return min and max values of the given field in the index search. Example: min and max price of catalogue items in the current search results set.
    • Item. It can return count of keyword of the given field in the index search. Example: count of computers having specific processor in the current search results set. Filtering by:
      • Keyword (ex. Processor type: Intel Core I7)
      • Value interval (ex. Display diagonal: 12.5 - 13.7)

    Facets can be ordered by

    • Index - search results ordering
    • Count - facet count value

    Ordering can be ascending / descending (reverse).


    Scoring search results

    Search engine use tf-idf (term frequency–inverse document frequency) method to score search results:

    CMS Full-Text Indexing and Search Basics | Overview | Documentation score (image)

    Where:

    • q - search query
    • d - item found using search query
    • D - set of all items in the whole index
    • t - term of the search query
    • w - word defined in the boost words list
    • f - index field
    • score(q,d) - score number of item found using search query
    • tf(t,d) - term frequency defined as the number of times term t appears in the currently scored item d (measure of how often a term appears in the item). Items that have more occurrences of a given term receive a higher score.
    • idf(t,D) - inverse item frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all items:
      CMS Full-Text Indexing and Search Basics | Overview | Documentation idf (image)
    • N - total number of items in the index
    • boost(t,w,f,d) - boosting number (default 1.0) consist of:
      CMS Full-Text Indexing and Search Basics | Overview | Documentation boost (image)
    • boost(t) - term boosting number defined in the search query, ex. object^5
    • boost(w) - word boosting number defined in the word boosting list
    • boost(f) - field boosting number defined using Search engine tools (Edit)
    • boost(d) - item boosting number defined using CMS tools (Page)