Pages

20 July, 2021

Coveo for Sitecore Upgrade - HtmlContentInBodyWithRequestsProcessor to FetchPageContentProcessor - Length cannot be less than zero

After upgrading from Coveo for Sitecore 4 to version 5, it is noted that there is a new processor which replaces the old HtmlContentInBodyWithRequestsProcessor. New processor in Coveo 5 is called as FetchPageContentProcessor. Similar to the previous processor, it executes an HTTP request, get the page response and then sends the data to the Coveo cloud. Enabling this processor delays the indexing. 

In Coveo 4 for Sitecore

<configuration xmlns:x="http://www.sitecore.net/xmlconfig/" 
  xmlns:patch="http://www.sitecore.net/xmlconfig/" xmlns:coveo="http://www.sitecore.net/xmlconfig/coveo/">
  <sitecore coveo:require="!disabled">
    <pipelines>
      <coveoPostItemProcessingPipeline>
        <processor type="Coveo.SearchProvider.Processors.HtmlContentInBodyWithRequestsProcessor, Coveo.SearchProviderBase">
          <StartCommentText>BEGIN NOINDEX</StartCommentText>
          <EndCommentText>END NOINDEX</EndCommentText>
        </processor>
      </coveoPostItemProcessingPipeline>
    </pipelines>
  </sitecore>
</configuration>

In Coveo 5 for Sitecore (recommended by Coveo)

<coveoPostItemProcessingPipeline>
  <processor type="Coveo.SearchProvider.Processors.ExecuteGetBinaryDataPipeline, Coveo.SearchProviderBase" />
</coveoPostItemProcessingPipeline>
<coveoGetBinaryData>
  <processor type="Coveo.SearchProvider.Processors.FetchPageContentProcessor, Coveo.SearchProviderBase">
    <inboundFilter hint="list:AddInboundFilter">
      <itemsWithLayout type="Coveo.SearchProvider.Processors.FetchPageContent.Filters.ItemsWithLayout, Coveo.SearchProviderBase" />
    </inboundFilter>
    <preAuthentication hint="list:AddPreAuthenticator" />
    <postProcessing hint="list:AddPostProcessing">
      <processor type="Coveo.SearchProvider.Processors.FetchPageContent.PostProcessing.CleanHtml, Coveo.SearchProviderBase">
        <startComment>BEGIN NOINDEX</startComment>
        <endComment>END NOINDEX</endComment>
      </processor>
    </postProcessing>
  </processor>
</coveoGetBinaryData>

In Coveo 5, there is a post processing processor called CleanHtml which will be executed on the fetched content after the HTTP request. This processor helps you to guide Coveo to index only a certain section of your web page. 

For an example, if you want to remove header, footer, navigation from the index document, you can mark the section using Start Comment and End Comment. In this configuration, it will be <!-- BEGIN NOINDEX --> and <!-- END NOINDEX -->

In a Sitecore instance with Coveo 4, we had nested comments as below. Coveo indexing with the HtmlContentInBodyWithRequestsProcessor processor were able to handle the nested comments and remove the section and send the HTML content to Coveo. 

<!-- BEGIN NOINDEX -->
    <!-- BEGIN NOINDEX -->
    	Content 1
    <!-- END NOINDEX -->
    <!-- BEGIN NOINDEX -->
    	Content 2
    <!-- END NOINDEX -->
<!-- END NOINDEX -->

In Coveo 5, the new processor CleanHtml throws below exception if there is a nested comments. I have even decompiled both the processor and tested the output of the HTML with nested comments and CleanHtml processor throws exception while removing content. 

ManagedPoolThread #19 02:19:52 ERROR An error occurred while trying to clean the HTML, no cleaning will be done.
Exception: System.ArgumentOutOfRangeException
Message: Length cannot be less than zero.
Parameter name: length
Source: mscorlib
   at System.String.Substring(Int32 startIndex, Int32 length)
   at Coveo.SearchProvider.Utils.HtmlCleaner.CleanHtmlContent(String p_HtmlContent, String p_StartCommentText, String p_EndCommentText)
   at Coveo.SearchProvider.Processors.HtmlContentInBodyWithRequestsProcessor.CollectHttpWebResponsesForAllClickableUris(List`1 p_CoveoIndexableItems, Dictionary`2 p_CleanedBinaryDataByUri)

We do not see a way to prevent this error when having a nested comments. When overriding the CleanHtml processor with the old processor method which cleans the HTML, it works but I do not think it is a good way to use the old code and patch it with the new processor. Raised a Coveo ticket to see if there is any workaround. 

Update: As per Coveo, there is no workaround in the Sitecore side, nested tags will need to be removed. 

  1. Jeff (@jflh) suggested that we could use a custom processor which cleans the HTML based on CSS selector instead of CleanHtml processor.
  2. He also provided a suggestion to use a Chrome extension which can help you to decide the Html elements to clean by the processor.Very useful. 
  3. Coveo has an Indexing Pipeline Extension (IPE) which works in the same fashion as mentioned in the first point (custom processor). This requires us to remove all BEGIN and END NOINDEX tags. We need to add a custom class coveo-no-index and IPE will select those sections and remove it. 

Sitecore Slack Chathttps://sitecorechat.slack.com/archives/C0CF16R9C/p1626979876116500

Reference

  1. Index Page Content With the FetchPageContentProcessor
  2. Nested NOINDEX tags preventing HTML content from being fetched during indexing

No comments:

Post a Comment

blockquote { margin: 0; } blockquote p { padding: 15px; background: #eee; border-radius: 5px; } blockquote p::before { content: '\201C'; } blockquote p::after { content: '\201D'; }