gather structured information from any website - ready for LLMs
The project separates concerns into distinct modules:
fetching– obtain raw data from a sourceparsing– transform raw data into a structured formextraction– pull relevant information from the parsed dataoutput– handle presentation of the extracted results
Install the project and run the CLI against a URL:
pip install ainfo
ainfo run https://example.comThe command fetches the page, parses its content and prints the page text.
Specify one or more built-in extractors with --extract to pull extra
information. For example, to collect contact details and hyperlinks:
ainfo run https://example.com --extract contacts --extract linksAvailable extractors include:
contacts– emails, phone numbers, addresses and social profileslinks– all hyperlinks on the pageheadings– text of headings (h1–h6)job_postings– structured job advertisement details like position and location
Use --json to emit machine-readable JSON instead of the default
human-friendly format. The JSON keys mirror the selected extractors, with
text included by default. Pass --no-text when you only need the
extraction results. Retrieve the JSON schema for contact details with
ainfo.output.json_schema.
For use within an existing asyncio application, the package exposes an
async_fetch_data coroutine:
import asyncio
from ainfo import async_fetch_data
async def main():
html = await async_fetch_data("https://example.com")
print(html[:60])
asyncio.run(main())To delegate information extraction or summarisation to an LLM, provide an
OpenRouter API key via the OPENROUTER_API_KEY environment variable and pass
--use-llm or --summarize:
export OPENROUTER_API_KEY=your_key
ainfo run https://example.com --use-llm --summarizeSummaries are generated in German by default. Override the language with
--summary-language <LANG> on the CLI or by setting the AINFO_SUMMARY_LANGUAGE
environment variable. Provide your own instructions for the LLM with
--summary-prompt "..." or point to a file containing the prompt via
--summary-prompt-file path/to/prompt.txt (useful for longer templates). The
AINFO_SUMMARY_PROMPT environment variable supplies a default prompt when no
CLI override is given.
If the target site relies on client-side JavaScript, enable rendering with a headless browser:
ainfo run https://example.com --render-jsTo crawl multiple pages starting from a URL and optionally run extractors on each page:
ainfo crawl https://example.com --depth 2 --extract contactsThe crawler visits pages breadth-first up to the specified depth and prints
results for every page encountered. Pass --json to output the aggregated
results as JSON instead.
Both commands accept --render-js to execute JavaScript before scraping, which
uses Playwright. Installing the browser drivers may
require running playwright install.
Utilities chunk_text and stream_chunks are available to break large
pages into manageable pieces when sending content to LLMs.
Most components can also be used directly from Python. Fetch and parse a page, then run the extractors yourself:
from ainfo.extractors import AVAILABLE_EXTRACTORS
from ainfo import fetch_data, parse_data, extract_information, extract_custom
html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")
# Contact details via built-in extractor
contacts = AVAILABLE_EXTRACTORS["contacts"](doc)
# All links
links = AVAILABLE_EXTRACTORS["links"](doc)
# Any additional data via regular expressions
extra = extract_custom(doc, {"prices": r"\$\d+(?:\.\d{2})?"})
print(contacts.emails, extra["prices"])Serialise results with to_json or inspect the JSON schema with
json_schema(ContactDetails).
To crawl multiple pages of the same site and aggregate the results in code,
use extract_site. Pages are fetched breadth-first, deduplicated using a
content hash and restricted to the starting domain by default:
from ainfo import extract_site
pages = extract_site("https://example.com", depth=2, include_text=True)
for url, data in pages.items():
print(url, data["contacts"].emails)Define your own extractor by writing a function that accepts a
Document and registering it in ainfo.extractors.AVAILABLE_EXTRACTORS.
# my_extractors.py
from ainfo.models import Document
from ainfo.extraction import extract_custom
from ainfo.extractors import AVAILABLE_EXTRACTORS
def extract_prices(doc: Document) -> list[str]:
data = extract_custom(doc, {"prices": r"\$\d+(?:\.\d{2})?"})
return data.get("prices", [])
AVAILABLE_EXTRACTORS["prices"] = extract_pricesAfter importing my_extractors your extractor becomes available on the
command line:
ainfo run https://example.com --extract prices --no-textextract_custom can also delegate to a large language model. Supply an
LLMService and a prompt describing the desired output:
from ainfo import fetch_data, parse_data
from ainfo.extraction import extract_custom
from ainfo.llm_service import LLMService
html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")
with LLMService() as llm:
data = extract_custom(
doc,
llm=llm,
prompt="List all products with their prices as JSON under 'products'",
)
print(data["products"])pip install ainfo
ainfo run https://example.com --json > contacts.jsonfrom ainfo import fetch_data, parse_data, chunk_text
from some_llm import summarize # pseudo-code
html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")
parts = [summarize(chunk) for chunk in chunk_text(doc.text_content(), 1000)]
print(" ".join(parts))Fetch and chunk a page directly by URL or pass in raw text:
from ainfo import stream_chunks
for chunk in stream_chunks("https://example.com", size=1000):
handle(chunk) # send to LLM or other processorCopy .env.example to .env and fill in OPENROUTER_API_KEY, OPENROUTER_MODEL, and OPENROUTER_BASE_URL to enable LLM-powered features. Optional overrides such as AINFO_SUMMARY_LANGUAGE and AINFO_SUMMARY_PROMPT customise the default summary behaviour.
For automated version bumping and releases, see RELEASE.md for documentation on using the release.sh script.
A minimal FastAPI wrapper and accompanying Dockerfile live in the integration/ directory. Build the container and run the service:
docker build -f integration/Dockerfile -t ainfo-api .
docker run -p 8877:8877 -e OPENROUTER_API_KEY=your_key -e AINFO_API_KEY=choose_a_secret ainfo-api
# or use an env file
docker run -p 8877:8877 --env-file .env ainfo-apiintegration/api.py now calls the Python APIs directly rather than shelling
out to the CLI. Two routes are available:
GET /run– legacy behaviour for quick single-page lookups (still renders with JavaScript, uses the contacts extractor and returns a summary)POST /run– fully configurable crawling endpoint that accepts a JSON body
Example request using the new POST /run endpoint:
curl -X POST \
-H 'X-API-Key: your_api_key' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"depth": 1,
"use_llm": true,
"summarize": true,
"summary_language": "English",
"summary_prompt": "Summarise the company positioning and recent news.",
"extract": ["contacts", "links"],
"include_text": false
}' \
http://localhost:8877/runBecause the prompt is part of the JSON payload it can be as long as needed without worrying about query-string limits. Responses contain one entry per visited page keyed by URL.
integration/api.py uses python-dotenv to load a .env file, so sensitive values
such as OPENROUTER_API_KEY can be supplied via environment variables. Protect the endpoint by setting AINFO_API_KEY and
include an X-API-Key header with that value on every request. This makes it easy to call ainfo from workflow tools like
n8n.
- The built-in
extract_informationtargets contact and social media details. Useextract_customfor other patterns or implement your own domain-specific extractors.