Topic on Extension talk:CirrusSearch/Flow

Integrate mapper-attachment-plugin to Extension:CirrusSearch?

4
Andreas Plank (talkcontribs)

(it’s related also to Topic:Search inside uploaded documents)

I’m running MW 1.28.2, Extension:CirrusSearch REL1_28, elasticsearch 2.4.5 and I’m experimenting to integrate plugin mapper-attachments to read all kinds of OFFICE file_media_type. Right now I find them by querying elasticsearch, but I’m unable to find the results in the wiki search. My guess is, it comes all down to the proper mapping which is tricky to achieve rightly.

  • Is it possible to use copy_to some how or plug into an existing search filter like “insource:” or do I have to write a SearchResult class? (I don't have access to hook CirrusSearchAddQueryFeatures MW1.29+)
  • how can I direct CirrusSearch to read also from my custom file_attachment or the sub field file_attachment.content?
  • Can anybody direct me into the right direction?

Thank you.

So far I managed to index file_media_type OFFICE to use the elasticsearch plugin and by using the CirrusSearch hooks, but the data are not found by CirrusSearch only in Elasticsearch:

$wgHooks['CirrusSearchMappingConfig'][] = function ( array &$config, $mappingConfigBuilder ) {
  foreach ($config['page']['properties'] as $key => &$PAGE_PROPERTIES) {
    if ($key == 'file_text') {
      /* https://stackoverflow.com/questions/36618549/is-it-possible-to-get-contents-of-copy-to-field-in-elasticsearch */
      $PAGE_PROPERTIES['store'] = true; /* add store=1 to defaults, no effect with copy_to */
    }
  }
  // plug in mapper-attachment
  $config['page']['properties']['file_attachment'] = [
    'type' => 'attachment',
    "fields" => [
       "content" => [
          "type" => "string",
          "copy_to" => ["all", "file_text"], /* no effect with copy_to */
          "analyzer" => "text",
          "search_analyzer" => "text_search",
       ]
    ]
  ];
};
$wgHooks['CirrusSearchBuildDocumentParse'][] = function ( 
  \Elastica\Document $Doc,
  Title $ThisTitle, 
  Content $PageContent, 
  ParserOutput $ParserOutput ) {
    global $wgTmpDirectory;
  
  $log_content= "\nDEBUG \$Doc:\n";
  $ThisLocalFile=wfFindFile($ThisTitle);
  $localFilePath = $ThisLocalFile instanceof File ? $ThisLocalFile->getLocalRefPath() : null;
  if ($Doc->namespace == NS_FILE
   && $Doc->has('file_media_type')
  ) {
    if (preg_match("@OFFICE@i", $Doc->get('file_media_type'))) {
      $Doc->set('file_attachment', base64_encode( file_get_contents($localFilePath) ) ) ;
      $log_content.= "\nDEBUG did set file_attachment\n";
    } else {
      $log_content.= "\nDEBUG file_media_type: {$Doc->file_media_type}\n";
    }
  }
  if ($Doc->namespace == NS_FILE) {
    $log_content.= "\nDEBUG \$ThisTitle:\n";
    $log_content.= var_export( $ThisTitle, true);
    
    $log_content.= "\nDEBUG \$ThisLocalFile:\n";
    $log_content.= $ThisLocalFile instanceof File ? $ThisLocalFile->getLocalRefPath() : var_export( $ThisLocalFile, true);
    $log_content.= var_export( $Doc, true);
    file_put_contents($wgTmpDirectory . "/CirrusSearchBuildDocumentParse.log", $log_content, FILE_APPEND );
  }
  return true;
};
require_once "$IP/extensions/Elastica/Elastica.php";
require_once "$IP/extensions/CirrusSearch/CirrusSearch.php";
DCausse (WMF) (talkcontribs)

using copy_to to file_text & all sounds like a good solution to me, at least a solution that should involve fewer modifications. The only thing I see that will be missing is the highlighting config to include your new field file_attachment.content. In short you'll be able to search for docs but you won't see any text snippets.

At a glance I don't see why it fails, do you run updateSearchIndexConfig everytime you change the hook to update the mapping?

Note that you can append &cirrusDumpQuery to a search results page to see the json query that will be sent to elastic, it could help to debug.

I don't see why you force store to true? It should not be needed.

Glad to see someone working on this, good luck!

Andreas Plank (talkcontribs)

Yes I did run updateSearchIndexConfig every time I changed mappings, but I also saw, that some changes did not appear in curl -XGET 'http://localhost:9200/_all/_mapping'. Often I deleted the whole index and mapping, to see if it was working.

And … well …, guess what: copy_to functionality is removed for sub fields of type attachment in version 2.4 (see Mapping changes - Elasticsearch Reference 2.4), which is the version I have to use. In former versions this would work. See also discussion on https://github.com/elastic/elasticsearch/issues/14946.

So, the question now is:

  • How can I direct extension:CirrusSearch to search also my file_attachment.content ?
  • Or is there another way to hook in for REL1_28?
DCausse (WMF) (talkcontribs)

Damn, it's a shame that copy_to is broken, the doc states the opposite...

Without copy_to I'm afraid you'll have to make more profound changes to CirrusSearch and hooks won't be sufficient.

I'd suggest to patch cirrus instead:

1. Config:

- Add a new wgCirrusSearchUseAttachmentPlugin config var to indicate that the plugin is installed

2. Mapping:

- Tweak getDefaultFields() in includes/Maintenance/MappingConfigBuilder.php to add your mapping (guarded by your new wgCirrusSearchUseAttachmentPlugin var)

- If you can: try to add a subfield named plain to the content subfield with analyzers plain and plain_search

3. Indexing: Update buildDocumentsForPages in includes/Updater.php to add the code you've added as the CirrusSearchBuildDocumentParse hook Same here, guard you new code with the wgCirrusSearchUseAttachmentPlugin config var and namespace == NS_FILE.

4. Search:

- Tweak buildFullTextSearchFields in includes/Query/FullTextQueryStringQueryBuilder.php :

- change

 return [ "all${fieldSuffix}^${weight}" ];

to

if ($context->getConfig()->get( 'CirrusSearchUseAttachmentPlugin' ) && (!$namespaces || in_array( NS_FILE, $namespaces ))) {
   return [ "all${fieldSuffix}^${weight}", "file_attachment.content${fieldSuffix} ];
} else {
   return [ "all${fieldSuffix}^${weight}" ];
}

But also add your new field in case the all field is not in use (in the same function): Change


                if ( !$namespaces || in_array( NS_FILE, $namespaces ) ) {
                        $fileTextWeight = $weight * $searchWeights[ 'file_text' ];
                        $fields[] = "file_text${fieldSuffix}^${fileTextWeight}";
                }

to


                if ( !$namespaces || in_array( NS_FILE, $namespaces ) ) {
                        $fileTextWeight = $weight * $searchWeights[ 'file_text' ];
                        $fields[] = "file_text${fieldSuffix}^${fileTextWeight}";
                        if ($context->getConfig()->get( 'CirrusSearchUseAttachmentPlugin' ) {
                           $fields[] = "file_attachement.content${fieldSuffix}^${fileTextWeight}";
                        }
                }

If you want you can adapt FullTextSimpleMatchQueryBuilder (not enabled by default)

5. Highlighting

Update FullTextResultsType#getHighlightingConfiguration in includes/Search/ResultsType.php and a line for your new field exactly the same way file_text is added


Sorry, this is not obvious but without copy_to the hooks cannot be used... And please feel free to upload patch to gerrit, I'd be happy to review it.

Reply to "Integrate mapper-attachment-plugin to Extension:CirrusSearch?"