This repository automates collecting and organizing personal documents (Instapaper posts, Snipd podcasts, Markdown notes, PDFs, images, tweets) in a local intranet workflow. Keep changes small, tested, and aligned with the current modular design.
- Source: top-level Python modules (for example
process_documents.py,pipeline_manager.py,utils.py,*_processor.py). - Tests:
tests/withpytestsuites and fixtures intests/fixtures/. - Utilities:
utils/for helper scripts (for exampledocflow_server.py,build_browse_index.py,build_reading_index.py,build_done_index.py). - Configuration:
config.py(paths and env vars).
- Install deps:
pip install requests beautifulsoup4 markdownify openai pillow pytest markdown - Tweet capture deps:
pip install "playwright>=1.55" && playwright install chromium - Run full pipeline:
python process_documents.py all --year 2026 - Selective run:
python process_documents.py pdfs md - Tweet queue:
python process_documents.py tweets - Create/refresh X likes state:
python utils/create_x_state.py --state-path "$HOME/.secrets/docflow/x_state.json" - Build intranet browse index:
python utils/build_browse_index.py --base-dir "/path/to/BASE_DIR" - Build intranet reading index:
python utils/build_reading_index.py --base-dir "/path/to/BASE_DIR" - Build intranet done index:
python utils/build_done_index.py --base-dir "/path/to/BASE_DIR" - Run intranet server:
python utils/docflow_server.py --base-dir "/path/to/BASE_DIR" --port 8080 - Full document ingestion runner:
bash bin/docflow.sh all - Tests (verbose):
pytest -v - Targeted tests:
pytest tests/test_docflow_server.py -q
- Single source of truth:
BASE_DIR. - Generated site:
BASE_DIR/_site/. - Local state:
BASE_DIR/state/(reading.json,done.json,highlights/).
- Current branch:
git branch --show-current(must bemain). - Remotes:
git remote -v(origin →https://github.com/domingogallardo/docflow.git). - Upstream tracking:
git rev-parse --abbrev-ref @{upstream} || echo "(no upstream)". - No pending changes:
git status -sb. - Last commit:
git log -1 --oneline(Conventional Commit style message). - No divergence:
git fetch -p && git status -sb(noahead/behind). - Push permissions:
git push --dry-run.
Useful setup (new environment):
- Identity:
git config --get user.name/git config --get user.email. - Credentials: valid GitHub HTTPS token or SSH key.
- Default upstream:
git push -u origin main(first time only).
- Keep all script messages in English.
BASE_DIRmust come fromDOCFLOW_BASE_DIRin~/.docflow_env; do not hardcode it inconfig.py.- For direct local commands in this repo, ensure environment is loaded first (
source ~/.docflow_env) soconfig.pyresolvesBASE_DIR. - When the user asks to restart the intranet service, do not restart
python utils/docflow_server.pymanually. It is mandatory to restart the LaunchAgentcom.domingo.docflow.intranetinstead, for example withlaunchctl kickstart -k gui/$(id -u)/com.domingo.docflow.intranet. - Preserve file
mtimeonly for existing content files insideBASE_DIR(for example articles/pages underPosts,Tweets, etc.) unless the task explicitly requires changing ordering semantics. - Do not preserve
mtimefor repository code/docs files outsideBASE_DIR(for exampledocs/*.md,utils/*.py, tests). - Use intranet routes and local state as canonical behavior (
docflow_server.py,_site,state). - API backward compatibility is not required in this repo: there is a single user/consumer. Prefer removing obsolete API terms and endpoints instead of keeping legacy aliases.
- Long-term operational notes live in
docs/memory.md.
- For URLs like
http://localhost:8080/posts/raw/Posts%202026/...html:- Resolve
BASE_DIRfromDOCFLOW_BASE_DIR(~/.docflow_env) or config fallback. - Decode URL-encoded filename.
- Check exact path under
BASE_DIR/Posts/Posts <YEAR>/. - If needed, do narrow search under
"$BASE_DIR/Posts"only.
- Resolve
- Avoid full-home searches unless explicitly requested.
- Python 3.10+; 4-space indentation; keep functions small and cohesive.
- Use type hints where practical and module-level docstrings.
- Modules:
snake_case.py; classes:CamelCase. - Reuse centralized helpers in
utils.pywhere possible. - Keep console logging concise.
- Framework:
pytest. - Add unit tests for new behavior and edge cases.
- Use
tmp_pathand monkeypatch for I/O isolation; avoid network. - Name tests by feature (for example
tests/test_docflow_server.py).
- Prefer Conventional Commits with optional scope (
feat(...),fix(...),tests(...),docs: ...). - PRs should include: rationale, behavioral impact, tests, and config/env notes.
- If behavior or CLI changes, update
README.mdand relevant docs.
- Do not commit secrets.
- Use env vars for credentials:
OPENAI_API_KEY,INSTAPAPER_USERNAME,INSTAPAPER_PASSWORD. - Optional year override:
DOCPIPE_YEAR. - Keep
TWEET_LIKES_STATEoutside the repo (for example"$HOME/.secrets/docflow/x_state.json") to avoid losing session state during repo cleanup. - Keep
DOCFLOW_BASE_DIRin~/.docflow_envaligned with your local environment.