ainfo

gather structured information from any website - ready for LLMs

Architecture

The project separates concerns into distinct modules:

fetching – obtain raw data from a source
parsing – transform raw data into a structured form
extraction – pull relevant information from the parsed data
output – handle presentation of the extracted results

Usage

Command line

Install the project and run the CLI against a URL:

pip install ainfo
ainfo run https://example.com

The command fetches the page, parses its content and prints the page text. Specify one or more built-in extractors with --extract to pull extra information. For example, to collect contact details and hyperlinks:

ainfo run https://example.com --extract contacts --extract links

Available extractors include:

contacts – emails, phone numbers, addresses and social profiles
links – all hyperlinks on the page
headings – text of headings (h1–h6)
job_postings – structured job advertisement details like position and location

Use --json to emit machine-readable JSON instead of the default human-friendly format. The JSON keys mirror the selected extractors, with text included by default. Pass --no-text when you only need the extraction results. Retrieve the JSON schema for contact details with ainfo.output.json_schema.

For use within an existing asyncio application, the package exposes an async_fetch_data coroutine:

import asyncio
from ainfo import async_fetch_data

async def main():
    html = await async_fetch_data("https://example.com")
    print(html[:60])

asyncio.run(main())

To delegate information extraction or summarisation to an LLM, provide an OpenRouter API key via the OPENROUTER_API_KEY environment variable and pass --use-llm or --summarize:

export OPENROUTER_API_KEY=your_key
ainfo run https://example.com --use-llm --summarize

Summaries are generated in German by default. Override the language with --summary-language <LANG> on the CLI or by setting the AINFO_SUMMARY_LANGUAGE environment variable. Provide your own instructions for the LLM with --summary-prompt "..." or point to a file containing the prompt via --summary-prompt-file path/to/prompt.txt (useful for longer templates). The AINFO_SUMMARY_PROMPT environment variable supplies a default prompt when no CLI override is given.

If the target site relies on client-side JavaScript, enable rendering with a headless browser:

ainfo run https://example.com --render-js

To crawl multiple pages starting from a URL and optionally run extractors on each page:

ainfo crawl https://example.com --depth 2 --extract contacts

The crawler visits pages breadth-first up to the specified depth and prints results for every page encountered. Pass --json to output the aggregated results as JSON instead.

Both commands accept --render-js to execute JavaScript before scraping, which uses Playwright. Installing the browser drivers may require running playwright install.

Utilities chunk_text and stream_chunks are available to break large pages into manageable pieces when sending content to LLMs.

Programmatic API

Most components can also be used directly from Python. Fetch and parse a page, then run the extractors yourself:

from ainfo.extractors import AVAILABLE_EXTRACTORS

from ainfo import fetch_data, parse_data, extract_information, extract_custom

html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")

# Contact details via built-in extractor
contacts = AVAILABLE_EXTRACTORS["contacts"](doc)

# All links
links = AVAILABLE_EXTRACTORS["links"](doc)

# Any additional data via regular expressions
extra = extract_custom(doc, {"prices": r"\$\d+(?:\.\d{2})?"})
print(contacts.emails, extra["prices"])

Serialise results with to_json or inspect the JSON schema with json_schema(ContactDetails).

To crawl multiple pages of the same site and aggregate the results in code, use extract_site. Pages are fetched breadth-first, deduplicated using a content hash and restricted to the starting domain by default:

from ainfo import extract_site

pages = extract_site("https://example.com", depth=2, include_text=True)

for url, data in pages.items():
    print(url, data["contacts"].emails)

Custom extractors

Define your own extractor by writing a function that accepts a Document and registering it in ainfo.extractors.AVAILABLE_EXTRACTORS.

# my_extractors.py
from ainfo.models import Document
from ainfo.extraction import extract_custom
from ainfo.extractors import AVAILABLE_EXTRACTORS

def extract_prices(doc: Document) -> list[str]:
    data = extract_custom(doc, {"prices": r"\$\d+(?:\.\d{2})?"})
    return data.get("prices", [])

AVAILABLE_EXTRACTORS["prices"] = extract_prices

After importing my_extractors your extractor becomes available on the command line:

ainfo run https://example.com --extract prices --no-text

LLM-based extraction

extract_custom can also delegate to a large language model. Supply an LLMService and a prompt describing the desired output:

from ainfo import fetch_data, parse_data
from ainfo.extraction import extract_custom
from ainfo.llm_service import LLMService

html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")

with LLMService() as llm:
    data = extract_custom(
        doc,
        llm=llm,
        prompt="List all products with their prices as JSON under 'products'",
    )
print(data["products"])

Workflow examples

Save contact details to JSON

pip install ainfo
ainfo run https://example.com --json > contacts.json

Summarize a large page with `chunk_text`

from ainfo import fetch_data, parse_data, chunk_text
from some_llm import summarize  # pseudo-code

html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")

parts = [summarize(chunk) for chunk in chunk_text(doc.text_content(), 1000)]
print(" ".join(parts))

Stream chunks on the fly

Fetch and chunk a page directly by URL or pass in raw text:

from ainfo import stream_chunks

for chunk in stream_chunks("https://example.com", size=1000):
    handle(chunk)  # send to LLM or other processor

Environment configuration

Copy .env.example to .env and fill in OPENROUTER_API_KEY, OPENROUTER_MODEL, and OPENROUTER_BASE_URL to enable LLM-powered features. Optional overrides such as AINFO_SUMMARY_LANGUAGE and AINFO_SUMMARY_PROMPT customise the default summary behaviour.

Development & Releases

For automated version bumping and releases, see RELEASE.md for documentation on using the release.sh script.

n8n integration

A minimal FastAPI wrapper and accompanying Dockerfile live in the integration/ directory. Build the container and run the service:

docker build -f integration/Dockerfile -t ainfo-api .
docker run -p 8877:8877 -e OPENROUTER_API_KEY=your_key -e AINFO_API_KEY=choose_a_secret ainfo-api
# or use an env file
docker run -p 8877:8877 --env-file .env ainfo-api

integration/api.py now calls the Python APIs directly rather than shelling out to the CLI. Two routes are available:

GET /run – legacy behaviour for quick single-page lookups (still renders with JavaScript, uses the contacts extractor and returns a summary)
POST /run – fully configurable crawling endpoint that accepts a JSON body

Example request using the new POST /run endpoint:

curl -X POST \
  -H 'X-API-Key: your_api_key' \
  -H 'Content-Type: application/json' \
  -d '{
        "url": "https://example.com",
        "depth": 1,
        "use_llm": true,
        "summarize": true,
        "summary_language": "English",
        "summary_prompt": "Summarise the company positioning and recent news.",
        "extract": ["contacts", "links"],
        "include_text": false
      }' \
  http://localhost:8877/run

Because the prompt is part of the JSON payload it can be as long as needed without worrying about query-string limits. Responses contain one entry per visited page keyed by URL.

integration/api.py uses python-dotenv to load a .env file, so sensitive values such as OPENROUTER_API_KEY can be supplied via environment variables. Protect the endpoint by setting AINFO_API_KEY and include an X-API-Key header with that value on every request. This makes it easy to call ainfo from workflow tools like n8n.

Limitations

The built-in extract_information targets contact and social media details. Use extract_custom for other patterns or implement your own domain-specific extractors.

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.github/workflows		.github/workflows
integration		integration
src/ainfo		src/ainfo
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
pyproject.toml		pyproject.toml
release.sh		release.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ainfo

Architecture

Usage

Command line

Programmatic API

Custom extractors

LLM-based extraction

Workflow examples

Save contact details to JSON

Summarize a large page with `chunk_text`

Stream chunks on the fly

Environment configuration

Development & Releases

n8n integration

Limitations

About

Uh oh!

Releases 9

Languages

License

MisterXY89/ainfo

Folders and files

Latest commit

History

Repository files navigation

ainfo

Architecture

Usage

Command line

Programmatic API

Custom extractors

LLM-based extraction

Workflow examples

Save contact details to JSON

Summarize a large page with chunk_text

Stream chunks on the fly

Environment configuration

Development & Releases

n8n integration

Limitations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Languages

Summarize a large page with `chunk_text`