Skip to content

Open Web Crawler cannot extract Swiftype-style meta tags via extraction_rulesets (meta class filtering limitation) #413

@lcrane777

Description

@lcrane777

Problem Description

We are migrating content from a legacy Swiftype/App Search–based implementation to the Elastic Open Web Crawler.
Our existing content relies on Swiftype-style meta tags embedded in HTML, for example:

During testing, we confirmed:

  • These meta tags exist in the HTML and are valid.
  • The Open Web Crawler successfully indexes standard fields such as title, body_content, links, headings, description, keywords, and site_name (via og:site_name or ingest fallbacks).
  • However, Swiftype-style meta tags are not extractable via extraction_rulesets.

Root cause identified through code review and testing:

  • extraction_rulesets only extract node text content.
  • tags do not contain inner text, only attributes (e.g., content=).
  • extract_by_css_selector and extract_by_xpath_selector return text nodes only.
  • Attribute values are therefore unreachable by design.
  • raw_html is not exposed in a way that reliably enables ingest pipeline–based extraction at crawl scale.

As a result, fields such as boost and published_at cannot be extracted at crawl time, even though the data is present and accessible externally.

This creates a migration gap for users moving from Swiftype/App Search metadata conventions to the Open Web Crawler.


Proposed Solution

Support attribute-level extraction in extraction_rulesets, for example:

Option A:
Allow extraction rules to specify an attribute:
selector: meta.swiftype[name=boost]
attribute: content

Option B:
Provide a generic meta-tag extraction helper similar to existing helpers (meta_keywords, meta_description, meta_tags_elastic), but configurable:

  • Support arbitrary meta tag classes (e.g., swiftype)
  • Support name/content pairs dynamically

Option C:
Optionally expose raw_html (or selected meta blocks) in a supported, documented way for ingest pipelines, with clear guidance on performance and storage implications.

Any one of these would allow users to extract structured metadata currently stored in meta attributes.


Alternatives

  • Manual post-processing using external scripts (e.g., Python crawlers) to re-fetch pages and update documents after indexing.
  • Re-implementing boost and published_at semantics using Elastic-native scoring, ranking, and date fields instead of Swiftype metadata.
  • Hard-coding additional meta helpers in the crawler for specific legacy conventions.

These alternatives are workable but add operational complexity and make migration more difficult for large sites.


Additional Context

  • This limitation appears to be intentional given the crawler’s current design, but it is not obvious from the documentation that attribute extraction is unsupported.
  • The issue primarily affects customers migrating from Swiftype/App Search who have years of metadata encoded in meta tags.
  • We can provide reproducible test URLs, crawler.yml configuration, and extracted HTML examples if needed.
  • This request is intended as an enhancement, not a bug report.

We are seeking confirmation on whether:

  • Attribute extraction support is planned.
  • This is an intentional long-term limitation.
  • There is a recommended migration pattern for Swiftype-style metadata.

@anish Mathur

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions