Open Web Crawler cannot extract Swiftype-style meta tags via extraction_rulesets (meta class filtering limitation)

### Problem Description

We are migrating content from a legacy Swiftype/App Search–based implementation to the Elastic Open Web Crawler.  
Our existing content relies on Swiftype-style meta tags embedded in HTML, for example:

<meta class="swiftype" name="boost" content="4">
<meta class="swiftype" name="published_at" content="2026-01-22T11:57:53.510-06:00">

During testing, we confirmed:

- These meta tags exist in the HTML and are valid.
- The Open Web Crawler successfully indexes standard fields such as title, body_content, links, headings, description, keywords, and site_name (via og:site_name or ingest fallbacks).
- However, Swiftype-style meta tags are not extractable via extraction_rulesets.

Root cause identified through code review and testing:
- extraction_rulesets only extract node text content.
- <meta> tags do not contain inner text, only attributes (e.g., content=).
- extract_by_css_selector and extract_by_xpath_selector return text nodes only.
- Attribute values are therefore unreachable by design.
- raw_html is not exposed in a way that reliably enables ingest pipeline–based extraction at crawl scale.

As a result, fields such as boost and published_at cannot be extracted at crawl time, even though the data is present and accessible externally.

This creates a migration gap for users moving from Swiftype/App Search metadata conventions to the Open Web Crawler.

---

### Proposed Solution

Support attribute-level extraction in extraction_rulesets, for example:

Option A:
Allow extraction rules to specify an attribute:
selector: meta.swiftype[name=boost]
attribute: content

Option B:
Provide a generic meta-tag extraction helper similar to existing helpers (meta_keywords, meta_description, meta_tags_elastic), but configurable:
- Support arbitrary meta tag classes (e.g., swiftype)
- Support name/content pairs dynamically

Option C:
Optionally expose raw_html (or selected meta blocks) in a supported, documented way for ingest pipelines, with clear guidance on performance and storage implications.

Any one of these would allow users to extract structured metadata currently stored in meta attributes.

---

### Alternatives

- Manual post-processing using external scripts (e.g., Python crawlers) to re-fetch pages and update documents after indexing.
- Re-implementing boost and published_at semantics using Elastic-native scoring, ranking, and date fields instead of Swiftype metadata.
- Hard-coding additional meta helpers in the crawler for specific legacy conventions.

These alternatives are workable but add operational complexity and make migration more difficult for large sites.

---

### Additional Context

- This limitation appears to be intentional given the crawler’s current design, but it is not obvious from the documentation that attribute extraction is unsupported.
- The issue primarily affects customers migrating from Swiftype/App Search who have years of metadata encoded in meta tags.
- We can provide reproducible test URLs, crawler.yml configuration, and extracted HTML examples if needed.
- This request is intended as an enhancement, not a bug report.

We are seeking confirmation on whether:
- Attribute extraction support is planned.
- This is an intentional long-term limitation.
- There is a recommended migration pattern for Swiftype-style metadata.

@Anish Mathur

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Web Crawler cannot extract Swiftype-style meta tags via extraction_rulesets (meta class filtering limitation) #413

Problem Description

Proposed Solution

Alternatives

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Open Web Crawler cannot extract Swiftype-style meta tags via extraction_rulesets (meta class filtering limitation) #413

Description

Problem Description

Proposed Solution

Alternatives

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions