Skip to content

MisterXY89/ainfo

Repository files navigation

ainfo

Publish documentation Upload Python Package

gather structured information from any website - ready for LLMs

Architecture

The project separates concerns into distinct modules:

  • fetching – obtain raw data from a source
  • parsing – transform raw data into a structured form
  • extraction – pull relevant information from the parsed data
  • output – handle presentation of the extracted results

Usage

Command line

Install the project and run the CLI against a URL:

pip install ainfo
ainfo run https://example.com

The command fetches the page, parses its content and prints the page text. Specify one or more built-in extractors with --extract to pull extra information. For example, to collect contact details and hyperlinks:

ainfo run https://example.com --extract contacts --extract links

Available extractors include:

  • contacts – emails, phone numbers, addresses and social profiles
  • links – all hyperlinks on the page
  • headings – text of headings (h1–h6)
  • job_postings – structured job advertisement details like position and location

Use --json to emit machine-readable JSON instead of the default human-friendly format. The JSON keys mirror the selected extractors, with text included by default. Pass --no-text when you only need the extraction results. Retrieve the JSON schema for contact details with ainfo.output.json_schema.

For use within an existing asyncio application, the package exposes an async_fetch_data coroutine:

import asyncio
from ainfo import async_fetch_data

async def main():
    html = await async_fetch_data("https://example.com")
    print(html[:60])

asyncio.run(main())

To delegate information extraction or summarisation to an LLM, provide an OpenRouter API key via the OPENROUTER_API_KEY environment variable and pass --use-llm or --summarize:

export OPENROUTER_API_KEY=your_key
ainfo run https://example.com --use-llm --summarize

Summaries are generated in German by default. Override the language with --summary-language <LANG> on the CLI or by setting the AINFO_SUMMARY_LANGUAGE environment variable. Provide your own instructions for the LLM with --summary-prompt "..." or point to a file containing the prompt via --summary-prompt-file path/to/prompt.txt (useful for longer templates). The AINFO_SUMMARY_PROMPT environment variable supplies a default prompt when no CLI override is given.

If the target site relies on client-side JavaScript, enable rendering with a headless browser:

ainfo run https://example.com --render-js

To crawl multiple pages starting from a URL and optionally run extractors on each page:

ainfo crawl https://example.com --depth 2 --extract contacts

The crawler visits pages breadth-first up to the specified depth and prints results for every page encountered. Pass --json to output the aggregated results as JSON instead.

Both commands accept --render-js to execute JavaScript before scraping, which uses Playwright. Installing the browser drivers may require running playwright install.

Utilities chunk_text and stream_chunks are available to break large pages into manageable pieces when sending content to LLMs.

Programmatic API

Most components can also be used directly from Python. Fetch and parse a page, then run the extractors yourself:

from ainfo.extractors import AVAILABLE_EXTRACTORS

from ainfo import fetch_data, parse_data, extract_information, extract_custom

html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")

# Contact details via built-in extractor
contacts = AVAILABLE_EXTRACTORS["contacts"](doc)

# All links
links = AVAILABLE_EXTRACTORS["links"](doc)

# Any additional data via regular expressions
extra = extract_custom(doc, {"prices": r"\$\d+(?:\.\d{2})?"})
print(contacts.emails, extra["prices"])

Serialise results with to_json or inspect the JSON schema with json_schema(ContactDetails).

To crawl multiple pages of the same site and aggregate the results in code, use extract_site. Pages are fetched breadth-first, deduplicated using a content hash and restricted to the starting domain by default:

from ainfo import extract_site

pages = extract_site("https://example.com", depth=2, include_text=True)

for url, data in pages.items():
    print(url, data["contacts"].emails)

Custom extractors

Define your own extractor by writing a function that accepts a Document and registering it in ainfo.extractors.AVAILABLE_EXTRACTORS.

# my_extractors.py
from ainfo.models import Document
from ainfo.extraction import extract_custom
from ainfo.extractors import AVAILABLE_EXTRACTORS

def extract_prices(doc: Document) -> list[str]:
    data = extract_custom(doc, {"prices": r"\$\d+(?:\.\d{2})?"})
    return data.get("prices", [])

AVAILABLE_EXTRACTORS["prices"] = extract_prices

After importing my_extractors your extractor becomes available on the command line:

ainfo run https://example.com --extract prices --no-text

LLM-based extraction

extract_custom can also delegate to a large language model. Supply an LLMService and a prompt describing the desired output:

from ainfo import fetch_data, parse_data
from ainfo.extraction import extract_custom
from ainfo.llm_service import LLMService

html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")

with LLMService() as llm:
    data = extract_custom(
        doc,
        llm=llm,
        prompt="List all products with their prices as JSON under 'products'",
    )
print(data["products"])

Workflow examples

Save contact details to JSON

pip install ainfo
ainfo run https://example.com --json > contacts.json

Summarize a large page with chunk_text

from ainfo import fetch_data, parse_data, chunk_text
from some_llm import summarize  # pseudo-code

html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")

parts = [summarize(chunk) for chunk in chunk_text(doc.text_content(), 1000)]
print(" ".join(parts))

Stream chunks on the fly

Fetch and chunk a page directly by URL or pass in raw text:

from ainfo import stream_chunks

for chunk in stream_chunks("https://example.com", size=1000):
    handle(chunk)  # send to LLM or other processor

Environment configuration

Copy .env.example to .env and fill in OPENROUTER_API_KEY, OPENROUTER_MODEL, and OPENROUTER_BASE_URL to enable LLM-powered features. Optional overrides such as AINFO_SUMMARY_LANGUAGE and AINFO_SUMMARY_PROMPT customise the default summary behaviour.

Development & Releases

For automated version bumping and releases, see RELEASE.md for documentation on using the release.sh script.

n8n integration

A minimal FastAPI wrapper and accompanying Dockerfile live in the integration/ directory. Build the container and run the service:

docker build -f integration/Dockerfile -t ainfo-api .
docker run -p 8877:8877 -e OPENROUTER_API_KEY=your_key -e AINFO_API_KEY=choose_a_secret ainfo-api
# or use an env file
docker run -p 8877:8877 --env-file .env ainfo-api

integration/api.py now calls the Python APIs directly rather than shelling out to the CLI. Two routes are available:

  • GET /run – legacy behaviour for quick single-page lookups (still renders with JavaScript, uses the contacts extractor and returns a summary)
  • POST /run – fully configurable crawling endpoint that accepts a JSON body

Example request using the new POST /run endpoint:

curl -X POST \
  -H 'X-API-Key: your_api_key' \
  -H 'Content-Type: application/json' \
  -d '{
        "url": "https://example.com",
        "depth": 1,
        "use_llm": true,
        "summarize": true,
        "summary_language": "English",
        "summary_prompt": "Summarise the company positioning and recent news.",
        "extract": ["contacts", "links"],
        "include_text": false
      }' \
  http://localhost:8877/run

Because the prompt is part of the JSON payload it can be as long as needed without worrying about query-string limits. Responses contain one entry per visited page keyed by URL.

integration/api.py uses python-dotenv to load a .env file, so sensitive values such as OPENROUTER_API_KEY can be supplied via environment variables. Protect the endpoint by setting AINFO_API_KEY and include an X-API-Key header with that value on every request. This makes it easy to call ainfo from workflow tools like n8n.

Limitations

  • The built-in extract_information targets contact and social media details. Use extract_custom for other patterns or implement your own domain-specific extractors.

About

gather structured information from any website - ready for LLMs

Topics

Resources

License

Stars

Watchers

Forks