Zero-dependency Python client for extracting structured content from Grokipedia pages.
pip install grokipedia-pyuv pip install https://github.com/caentzminger/grokipedia-py.git
uv add "grokipedia-py @ git+https://github.com/caentzminger/grokipedia-py.git"from grokipedia import from_url
page = from_url("https://grokipedia.com/page/13065923")
print(page.title)
print(page.slug)
print(page.intro_text)
print(page.infobox[:3])
print(page.lead_figure)
print([section.title for section in page.sections])
first_media = next(
(
subsection.media
for section in page.sections
for subsection in section.subsections
if subsection.media
),
[],
)
print(first_media[:1])
print(len(page.references))
print(page.links[:5])
print(page.metadata.keywords)
print(page.markdown[:500])
print(page.to_json(indent=2))Parse raw HTML without network access:
from grokipedia import from_html
page = from_html(html, source_url="https://grokipedia.com/page/13065923")Resolve a page from a title:
from grokipedia import page
page_obj = page('"Hello, World!" program')Search for page URLs:
from grokipedia import search
results = search("hello world")
print(results[:5])If this returns [], try:
results = search("hello world", respect_robots=False)As of February 18, 2026, https://grokipedia.com/robots.txt disallows /api/, and /search is mostly client-rendered HTML.
Use class-based API with sitemap manifest caching:
from grokipedia import Grokipedia
wiki = Grokipedia(verbose=True)
result = wiki.page("The C Programming Language")
matches = wiki.search("programming language")
# Lazy sitemap lookup + cached child sitemap manifests.
url = wiki.find_page_url('"Hello, World!" program')
manifest = wiki.refresh_manifest()The library uses Python's standard logging module (logger namespace: grokipedia).
import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger("grokipedia").setLevel(logging.DEBUG)This project stays runtime dependency-free (dependencies = []) and relies on
the standard library for runtime behavior.
just setup
just fmt-py
just lint-py
just typecheck
just test
just cifrom_url() enforces robots.txt by default.
respect_robots=True(default): validaterobots.txtbefore page fetch.search()first tries/api/full-text-searchand falls back to/searchHTML parsing.allow_robots_override=False(default): strict mode.- if
robots.txtis unavailable or malformed, the library fails closed withRobotsUnavailableError. - if URL is disallowed, it raises
RobotsDisallowedError.
You can bypass robots enforcement by setting either:
respect_robots=False, orallow_robots_override=True
from_url() and from_html() return Page with:
urlslugtitleintro_textinfobox(InfoboxFieldlist fordt/ddfact rows)lead_figure(LeadFigurefrom the top figure image/caption when present)sections(Sectiontree with nestedsubsections; each section includes indexedmedia)references(Referencelist)links(ordered unique links extracted from the main article)metadata(PageMetadata, including optionalkeywords)
Page also includes:
lede_text(alias ofintro_text)lead_media(alias oflead_figure)markdownto_dict()/to_json()from_dict()/from_json()
All library exceptions inherit from GrokipediaError.
FetchErrorHttpStatusErrorPageNotFoundErrorRobotsUnavailableErrorRobotsDisallowedErrorParseError