| name | sociable-unit-tests |
|---|---|
| description | Guide for writing sociable unit tests with directory isolation, no mocks, and transaction-level testing. Use when writing tests, reviewing test quality, or setting up test infrastructure for a Python project. |
This skill covers three concerns:
- Sociable Unit Testing Philosophy — minimize mocking, test complete transactions through real code paths, isolate via environment variables and temp directories
- Test Infrastructure and Documentation — directory layout, test segregation, venv setup, and README documentation for running tests in Python repos
- Agent Tests — LLM inference tests with controlled context, no ambient configuration, predictable within model limits
The infrastructure and documentation section is general Python project hygiene, not specific to sociable testing. It is included here because the philosophy directly drives setup decisions (e.g., unit tests must be the default pytest target, integration suites must be segregated by reliability profile). It may be separated into a dedicated skill if skill composition conventions emerge.
Jay Fields coined the terms "sociable" and "solitary" for two schools of unit testing. Martin Fowler popularized them:
- Solitary tests isolate the unit under test using mocks or test doubles for all collaborators. The classically "mockist" approach.
- Sociable tests let the unit interact with real implementations of its collaborators. Isolation comes from controlling inputs and the environment, not from replacing internal code with fakes.
Fowler presents both as valid. This skill does not.
Write sociable tests. Avoid solitary tests.
The unit of testing is a meaningful transaction — a complete operation through the business logic layer (SDK, core, lib, domain, service — whatever the project calls it), not an individual function or module in isolation.
No mocks unless you ask. Isolate via temp dirs and env vars, not by replacing collaborators with fakes.
Micro-isolating tests to a function or module that alone cannot complete a meaningful transaction produces broad swaths of technical-sounding tests where it becomes difficult to assess what actual functionality is or isn't covered. When a test mocks every collaborator, it verifies that your code calls the mocks correctly, not that it works. The findings are over-isolated to the point of risky irrelevance: all tests pass, but the system is broken at the seams.
This approach leads to repo pollution and bloat — hundreds of tests that are expensive to maintain, tightly coupled to implementation details (not behavior), and fragile to refactoring. Renaming an internal method breaks dozens of tests that were never verifying user-facing behavior in the first place.
Reserve mocking for system boundaries where real collaboration is impractical — network calls, external services, clocks. Everywhere else, let the code run.
Bias strongly toward sociable unit tests. Cover everything possible as fast, offline, deterministic unit tests before reaching for any form of integration testing.
-
Sociable unit tests — the default. Every repo should have these. Cover all business logic through the core layer and interface layers using in-process testing (no network, no credentials, no external systems). This is where the vast majority of test value lives.
-
Agent integration tests (
tests/integration/agent/) — add only when the repo's functionality involves LLM inference that needs behavioral verification. Keep them minimal, maximally reliable, and controlled. Do not look for opportunities to add agent tests to every project. -
Real-user integration tests (
tests/integration/real-user/) — last resort. Add only when the repo's core functionality depends largely or entirely on an external system that can only be tested with real, private user accounts that are not reusable across developers. Examples: proxying a third-party API with no sandbox environment, wrapping a platform where each developer must supply their own credentials and account data. Avoid entirely if unit tests with a local provider or contract testing can cover the same ground.
The principle: use the real thing when it's controllable and fast. Mock when it's uncontrollable or slow. These two properties almost always go together, but not always — a local database is controllable but slow to spin up, so it belongs in integration tests, not unit tests.
Mock (uncontrollable or slow):
- Network calls (APIs, HTTP, external services)
- System clocks (if time precision matters)
- Heavy external processes (database servers, containers)
Don't mock (controllable and fast):
- File system operations — point at a temp dir instead
- Configuration parsers — write real config files to temp dirs
- Environment variables — set the real env var and revert it
when the test ends. Use your test framework's built-in env
management (e.g.,
monkeypatch.setenv()in pytest) rather than raw env manipulation with manual cleanup — the framework guarantees revert even if the test throws, and handles parallel execution safely. This is not mocking; the real environment is modified and restored. - Core layer methods called by interface layers
- Ubiquitous CLI tools that are either too common to mention
as prerequisites (e.g.,
git) or are documented project dependencies — run them for real against isolated temp dirs
When setting up tests for a project, there are two valid approaches to handling external dependencies. Choose one and use it consistently.
Option A: Mock at the network call. Stub or patch the HTTP client so the core layer's own code runs fully but never hits the wire. Simpler when the core layer is a thin wrapper around an API.
Option B: Local provider with real data. Build an alternative backend (e.g., an in-memory store, a local database, a file-backed provider) that implements the same interface as the real provider. Tests run against real data through real code paths — no patching at all. This eliminates the network boundary entirely rather than mocking it.
When to prefer Option B:
- The core layer has a provider/repository abstraction already
- Multiple test files need the same realistic data set
- You want to test filtering, sorting, and aggregation logic without maintaining mock response fixtures per test
- The project already uses this pattern — continue it
If existing tests already follow one pattern, continue that pattern. Consistency within a repo matters more than theoretical preference. Do not introduce Option A into a repo using Option B or vice versa without explicit approval.
The local provider pattern is a form of what Martin Fowler calls narrow integration testing — exercising the code that would talk to an external service, but with the boundary replaced entirely. The seed data that the local provider serves is effectively an informal contract — a captured representation of the external API's response shape. For repos that proxy third-party platforms, consider periodically verifying that the seed data still matches the real API shape (this is the core idea behind contract testing, formalized by tools like Pact).
Every project that reads from config or data directories must support env var overrides for those paths. Tests use these to redirect to temp directories.
Pattern (Python example):
# In the core layer (config.py or similar):
def get_config_dir() -> Path:
path_str = os.environ.get("MY_APP_CONFIG_DIR")
if path_str:
return Path(path_str)
return Path.home() / ".config" / "my-app"# In conftest.py:
@pytest.fixture(autouse=True)
def isolate_config(tmp_path, monkeypatch):
"""Auto-isolate every test from real config directories."""
test_config = tmp_path / "config"
test_config.mkdir(parents=True, exist_ok=True)
monkeypatch.setenv("MY_APP_CONFIG_DIR", str(test_config))Key points:
autouse=True— applies to every test automaticallytmp_path— unique per test, auto-cleanedmonkeypatch.setenv()— auto-reverted when test ends, no cross-test leakage, no impact on parallel suites. Safer than rawos.environmanipulation.- Place in
conftest.py— central, no imports needed per test
Isolate ALL writable paths: config, data, cache, XDG directories. If the core layer writes to it, tests must redirect it.
When using a local provider for testing (Python example):
# conftest.py
@pytest.fixture
def test_db_path(tmp_path):
"""Copy test data to temp directory."""
src = Path(__file__).parent / "fixtures" / "test_data.json"
dest = tmp_path / "test_data.json"
shutil.copy(src, dest)
return dest
@pytest.fixture
def local_provider(test_db_path):
"""Create a local provider with test data."""
provider = LocalProvider(db_path=test_db_path)
yield provider
provider.close()Seed data lives in tests/unit/fixtures/ — either a static
file or generated from a seed file by a script. The seed file
is the only file that needs human review; generated data is
gitignored.
All business logic lives in the core layer. CLI, MCP, REST, and other interfaces are thin wrappers.
Test the core layer directly — don't test through interface layers when verifying business logic.
# Good: test core layer directly
def test_add_command_creates_config(tmp_path, monkeypatch):
monkeypatch.setenv("MY_APP_CONFIG_DIR", str(tmp_path))
result = core.add_command("fix", "Fix the bug")
assert (tmp_path / "commands" / "fix.toml").exists()
# Bad: test through CLI subprocess
def test_add_command_cli(tmp_path):
result = subprocess.run(["my-app", "cmds", "add", ...])
# Slower, harder to isolate, tests CLI parsing not logicThin, in-process tests for CLI, HTTP/REST, MCP, or other interface layers. No out-of-process calls, no network traffic. These verify wiring and request/response shape, not business logic.
Examples:
- HTTP/ASGI: in-memory test client (e.g., httpx
ASGITransportin Python, supertest in Node) - CLI: built-in test runner (e.g., Click's
CliRunnerin Python, commander's programmatic API in Node) - MCP: stdio protocol over pipes or in-process invocation
Interface tests belong in tests/unit/ alongside core layer
tests — they're fast, deterministic, and need no credentials.
Test the full flow of a feature, not individual functions in isolation.
Example: testing a "label email" feature:
- Search for target email — confirm it exists without the label
- Read email details (before) — assert label absent
- Apply label — execute the operation
- Read email details (after) — assert label present, other fields unchanged
This catches integration bugs between steps that isolated function tests miss.
Test names describe scenario + outcome, not implementation details.
Good:
test_not_installed_when_no_hooks_pathtest_refuses_to_overwrite_foreign_hooktest_init_creates_manifest_and_dockerfiletest_logs_food_to_current_date_directory
Bad:
test_returns_false_when_file_missingtest_respects_configtest_returns_true
tests/
unit/ # Fast, offline, no credentials (DEFAULT)
conftest.py # Shared fixtures (isolation, providers)
fixtures/ # Seed data, test configs
integration/ # Non-default suites (explicit invocation only)
real-user/ # Hits live third-party APIs with real
# credentials and real user data. Not stable
# across contributors — each needs their own
# account/credentials.
agent/ # LLM inference tests (see Part 3)
Most repos need only tests/unit/. Do not add integration
subdirectories speculatively — add them when a specific,
unavoidable testing need arises that unit tests cannot cover.
See the Testing Hierarchy in Part 1 for when each level is
appropriate.
Unit tests are the only default for bare pytest. Configure
this in pyproject.toml:
[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "--ignore=tests/integration"Running tests:
pytest # unit only (default)
pytest tests/integration/real-user/ # explicit suite
pytest tests/integration/agent/ # explicit suite
pytest tests/integration/ # all integration
pytest tests/ # everythingPassing a path as a positional arg overrides the configured
testpaths — no additional configuration or code needed.
- Fast, local, no network, no credentials.
- Subprocess calls only to ubiquitous tools (e.g.,
git init,git commit— nevergit push). - Every test must be isolated from every other test and from the real filesystem.
Gitignore the venv directory. Ensure .venv/ is in
.gitignore before creating one.
Test dependencies go in the project's dev extras:
# pyproject.toml
[project.optional-dependencies]
dev = ["pytest", "pytest-asyncio"]The project's README (or CONTRIBUTING.md) must document how to run each test suite. Template:
## Testing
**Unit tests** run offline with no credentials:
make setup # bootstrap venv (once)
pytest # run unit tests
Or from a fresh clone:
make test # creates venv + runs unit tests
**Integration tests** require additional setup:
| Suite | Path | Prerequisites |
|-------|------|--------------|
| real-user | `tests/integration/real-user/` | Platform credentials (see Authentication) |
| agent | `tests/integration/agent/` | `ANTHROPIC_API_KEY` |
Run explicitly:
pytest tests/integration/real-user/
pytest tests/integration/agent/
These are excluded from the default `pytest` invocation.make test always runs unit tests only. make setup is
independently callable so contributors can bootstrap the venv
and then use pytest directly for non-default suites:
setup:
python -m venv .venv
.venv/bin/pip install -e ".[dev]"
test: setup
.venv/bin/pytestmake test # unit tests (fresh clone safe)
make setup && pytest tests/integration/... # bootstrap then run any suite- Create
tests/unit/withconftest.py - Decide on boundary strategy (Option A or B) — see Part 1
- Configure
pyproject.tomlto ignoretests/integration/by default - Create isolation fixtures in
conftest.py - Ensure all core layer config/data paths support env var overrides
- Add
pytestto dev dependencies - Gitignore
.venv/ - Add
setupandtestMakefile targets - Document test setup in README or CONTRIBUTING.md using the template above
- Verify: bare
pytestruns only unit tests
Where in the repo's docs the testing setup lives depends on the project's type, technicality, and how central testing is to someone cloning the repo. Acceptable placements:
- README.md — a short
## Testingsection showing minimal setup + invocation. Best when testing is central to what the repo IS (a library, a CLI tool, a repo whose value proposition includes its test suite). - CONTRIBUTING.md — fuller setup instructions alongside other dev-workflow content. Best when the primary audience is evaluate → select → install → use, and testing is mostly for contributors.
- docs/TESTING.md — the bulk of detail lives there, with a short pointer from README or CONTRIBUTING. Best when the testing story is detailed enough that including it inline would dominate the surrounding doc.
Any of these is fine. When extending an existing repo, follow the precedent already established — don't introduce a new location if the repo already has one of these patterns. Consistency within a codebase matters more than picking the "best" placement in the abstract.
A repo may carry a custom test runner script as an optional
alternative to invoking pytest directly — e.g., ./runtests,
./my-app test <options>, or similar.
This skill never introduces such a script. If a repo does not
already have one, do not create one as part of any testing-setup
task. The documented make setup + pytest path is the canonical
entry point; a runner is pure convenience on top of it.
When a runner is already present, enforce these compliance rules:
-
Pure stdlib. The runner must run immediately after clone with no prerequisites — no
make setup, no manual venv bootstrap, nopip install. Only the Python standard library. Anyone with Python 3 can clone the repo and invoke it. -
Auto-bootstrap the same venv. The runner must internally perform the same venv setup that the README / CONTRIBUTING / docs/TESTING.md documents as the manual prerequisite for direct
pytestinvocation — same directory name (e.g..venv/), same dependency source (the project's declared dev dependencies —pyproject.toml,setup.py,requirements*.txt, or whatever the docs point at), same activation approach. A contributor who runsmake setupand then usespytestdirectly must be able to switch to the runner (or vice versa) without the venv changing shape. -
README documentation shows both paths. When a runner exists, the testing-setup doc must present it alongside bare
pytestandmake test, so contributors know all paths lead to the same outcome. -
Runner is additive, never authoritative. Everything the runner does must be achievable by following the documented manual setup + bare
pytest. The runner is a shortcut, not a parallel ecosystem. Test-behavior decisions (which tests run, whatpytest.inifilters apply, which environment variables are set) stay inpytest.ini/pyproject.toml/conftest.py— the runner just invokes pytest after preparing the environment.
If an existing runner violates any of these, bring it into compliance or remove it. A divergent runner is worse than no runner — it silently installs a different dependency set, documents an inconsistent setup, or claims to be a shortcut while actually being a second way to test.
Agent tests follow the sociable unit testing philosophy from Part 1 — no mocks of internal collaborators, test complete transactions, meaningful assertions — with one exception: inference calls to an LLM are inherently non-deterministic, so assertions check for reasonable signals rather than exact outputs.
This section is optional. Most repos do not need agent tests. Use this when a repo tests AI inference behavior and prompts are expected to produce reliable, testable outputs.
- Known model version — pin the model in the test, not in ambient config
- Identical prompt and context each run — no user-scoped, machine-scoped, or project context files that aren't fully controlled by the test
- No tool calls to external services or user-specific data — stub or mock all data backends with test fixtures
- Assert on reasonable signals, not exact strings — check that expected data appeared, a tool was called, or output contains key indicators
When testing via an agent CLI program (e.g., claude -p,
gemini -p) rather than a direct API call, these programs
load user-scoped and machine-scoped context by default —
global context files, plugin hooks, MCP registrations, shell
environment. Tests must disable all ambient context so the
only context the model sees is what the test explicitly
provides. For example, claude --bare strips all user and
project context.
Tests where the model makes tool calls to services backed by
shifting data, user-specific accounts, or configurations that
vary across contributors. Those belong in
tests/integration/real-user/ or are not suitable for
automated testing.
Agent tests require connectivity and API key env vars. They
are excluded from default pytest and documented in the
README alongside other integration suites.
pytest tests/integration/agent/# tests/integration/agent/conftest.py
import shutil, os, pytest
_HAS_KEY = bool(os.environ.get("ANTHROPIC_API_KEY"))
requires_api_key = pytest.mark.skipif(
not _HAS_KEY, reason="ANTHROPIC_API_KEY not set"
)
@pytest.fixture
def mcp_config(tmp_path):
"""MCP config with stub backends only."""
config = {"mcpServers": {
"my-app": {"command": "my-app-mcp", "args": ["stdio"]},
"stub-backend": {"command": "python", "args": ["tests/stubs/backend.py"]},
}}
path = tmp_path / "mcp.json"
path.write_text(json.dumps(config))
return str(path)# tests/integration/agent/test_prompts.py
@requires_api_key
def test_tool_invocation(mcp_config):
result = subprocess.run(
["claude", "--bare", "--mcp-config", mcp_config,
"-p", "List items with a balance over 500"],
capture_output=True, text=True, timeout=120,
)
assert result.returncode == 0
# Assert on signals, not exact output
assert "500" in result.stdout or "balance" in result.stdout.lower()