EU AI Act RAG

A Full-Featured RAG Pipeline Example Based on Regulation (EU) 2024/1689 — The European AI Act

Pipeline

flowchart TD
    CLI["run.py --workflow workflows/eu-ai-act.yaml"]
    CONFIG["src/config.py<br/>load_config → PipelineConfig"]
    PIPELINE["src/pipeline.py<br/>run_pipeline"]

    CLI -->|"YAML parse"| CONFIG
    CONFIG -->|"PipelineConfig"| PIPELINE

    subgraph PHASE_1 ["Phase 1 — SPARQL"]
        S_LOOP["For each step in<br/>config.sparql.steps"]
        S_RENDER["src/sparql/queries.py<br/>render_step"]
        S_QUERY["src/sparql/client.py<br/>execute_query"]
        S_SCRIPT["src/sparql/processor.py<br/>execute_script"]
        S_CTX["context[step.name] = output"]

        S_LOOP --> S_RENDER
        S_RENDER -->|"SPARQL query string"| S_QUERY
        S_QUERY -->|"list&lt;Binding&gt;"| S_SCRIPT
        S_SCRIPT -->|"output"| S_CTX
        S_CTX -->|"next step"| S_LOOP
    end

    subgraph PHASE_2 ["Phase 2 — Fetch"]
        F_URI["select_uri<br/>uri_select_script from YAML"]
        F_DL["_download<br/>GET + Accept: application/zip"]
        F_ZIP["_extract_zip<br/>ZIP → *.fmx.xml to tmpdir"]

        F_URI -->|"manifestation URL"| F_DL
        F_DL -->|"bytes"| F_ZIP
    end

    subgraph PHASE_3 ["Phase 3 — Parse"]
        P_ACT["*.000101.fmx.xml<br/>Main ACT file"]
        P_ART["parse_articles<br/>DIVISION → ARTICLE → PARAG"]
        P_REC["parse_recitals<br/>PREAMBLE → CONSID → NP"]
        P_ANX["parse_annex<br/>*.01XXXX.fmx.xml × 13"]
        P_DOC["ParsedDocument<br/>articles + recitals + annexes"]

        P_ACT --> P_ART
        P_ACT --> P_REC
        P_ANX --> P_DOC
        P_ART --> P_DOC
        P_REC --> P_DOC
    end

    subgraph PHASE_4 ["Phase 4 — Convert + Postprocess"]
        C_FM["_build_frontmatter<br/>YAML frontmatter from templates"]
        C_BODY["_article_to_markdown<br/>Structured body text"]
        C_NORM["_normalize<br/>NBSP, ZWSP, BOM cleanup"]
        C_WRITE["write_text<br/>corpus/**/*.md"]

        C_FM --> C_NORM
        C_BODY --> C_NORM
        C_NORM --> C_WRITE
    end

    subgraph PHASE_V ["Phase 3.5 — Validate (advisory)"]
        V_SRC["_build_source_text_map<br/>Re-parse XML → raw text"]
        V_DET["_run_deterministic<br/>counts · empty · numbering · structure"]
        V_COV["coverage ratio<br/>parsed_len / source_len"]
        V_RPT["validation-report.json"]

        V_SRC --> V_DET --> V_COV --> V_RPT
    end

    PIPELINE --> PHASE_1
    PHASE_1 -->|"context dict"| PHASE_2
    PHASE_2 -->|"tmpdir: Path"| PHASE_3
    PHASE_3 -->|"ParsedDocument"| PHASE_V
    PHASE_V -->|"always continues"| PHASE_4

    PHASE_4 -->|"PipelineSummary"| SUMMARY["Pipeline Summary<br/>sparql: 4 ok | fetch: 1 ok | validation: 3 ok<br/>articles: 113 | recitals: 180 | annexes: 13"]

    style PHASE_1 fill:#1a1a2e,stroke:#e94560,color:#fff
    style PHASE_2 fill:#1a1a2e,stroke:#0f3460,color:#fff
    style PHASE_3 fill:#1a1a2e,stroke:#16213e,color:#fff
    style PHASE_V fill:#1a1a2e,stroke:#c0392b,color:#fff
    style PHASE_4 fill:#1a1a2e,stroke:#1a8a42,color:#fff

Quick Start

pip install -r workflow-engine/requirements.txt
python run.py --workflow workflows/eu-ai-act.yaml

Output

dist/eu-ai-act-{timestamp}/
├── validation-report.json
└── corpus/
    ├── articles/*.md          (113)
    ├── recitals/*.md          (180)
    └── annexes/*.md           (13)

Validation

The pipeline runs a deterministic validation step between parse and convert (Phase 3.5). It operates in advisory mode — logs warnings and generates a report but never blocks the pipeline.

Two independent text extraction paths are compared for each item:

Path	Method	Scope
Source (validator)	`etree.tostring(method="text")`	All text nodes — exhaustive
Parsed (parser)	Selective tag traversal	Only handled elements

If the parser's structural traversal misses a sub-tree, the exhaustive source path catches it through the coverage ratio (parsed_len / source_len).

Checks:

Check	Type	Threshold
Count validation	articles=113, recitals=180, annexes=13	exact match
Empty content	items with no body text	any = fail
Sequential numbering	gaps in article numbers	any = warn
Structural integrity	missing title or chapter context	any = warn
Coverage ratio	`parsed_len / source_len` per item	< 0.8 = warn

Configuration in workflows/eu-ai-act.yaml under the validation: section.

RAG Configuration

Parameter	Value
Embedding Model	`@cf/qwen/qwen3-embedding-0.6b`
Generation Model	`@cf/meta/llama-3.3-70b-instruct-fp8-fast`
Chunk Size	384 tokens
Chunk Overlap	20%
Vector Store	Cloudflare Vectorize (1024 dimensions)
Object Storage	Cloudflare R2

Cloudflare Documentation

Project Structure

eu-ai-act-rag/
├── run.py                          # Entry point
├── workflows/
│   └── eu-ai-act.yaml             # Workflow definition
├── workflow-engine/
│   ├── requirements.in
│   ├── requirements.txt
│   └── src/
│       ├── config.py
│       ├── converter.py
│       ├── fetcher.py
│       ├── logger.py
│       ├── parser.py
│       ├── pipeline.py
│       ├── result.py
│       ├── validator.py
│       └── sparql/
│           ├── client.py
│           ├── processor.py
│           └── queries.py
├── worker/                         # Cloudflare Worker (AutoRAG API)
│   ├── wrangler.jsonc
│   ├── package.json
│   └── src/
│       ├── index.ts
│       ├── config.ts
│       ├── types.ts
│       ├── translations.ts
│       ├── ai/
│       │   ├── manager.ts
│       │   └── prompts/
│       │       └── system-prompt.md
│       ├── middleware/
│       │   └── ratelimit.ts
│       ├── validation/
│       │   ├── request.ts
│       │   └── schema.ts
│       └── utils/
│           ├── errors.ts
│           └── logging.ts
├── playground/                     # Streamlit on Cloudflare Containers
│   ├── wrangler.jsonc
│   ├── package.json
│   ├── Dockerfile
│   ├── start.sh
│   ├── src/
│   │   └── index.ts               # Hono reverse proxy
│   └── app/
│       ├── app.py
│       ├── translations.py
│       ├── export_utils.py
│       ├── locales/
│       │   ├── en.json
│       │   └── tr.json
│       └── .streamlit/
│           └── config.toml
└── dist/                           # Output (gitignored)

License

See LICENSE and THIRD_PARTY_LICENSES for details.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
assets		assets
playground		playground
worker		worker
workflow-engine		workflow-engine
workflows		workflows
.gitignore		.gitignore
LICENSE		LICENSE
PRIVACY_NOTICE.md		PRIVACY_NOTICE.md
PRIVACY_NOTICE_TR.md		PRIVACY_NOTICE_TR.md
README.md		README.md
README_TR.md		README_TR.md
THIRD_PARTY_LICENSES		THIRD_PARTY_LICENSES
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EU AI Act RAG

Pipeline

Quick Start

Output

Validation

RAG Configuration

Cloudflare Documentation

Project Structure

License

About

Uh oh!

Releases 5

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EU AI Act RAG

Pipeline

Quick Start

Output

Validation

RAG Configuration

Cloudflare Documentation

Project Structure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Contributors

Uh oh!

Languages