Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions .agent/repo=.this/role=any/briefs/diff-boundary-nav.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# diff boundary navigation

## .what

ctrl+d j/k navigates diff boundaries, not just chunks. behavior:
- inside chunk → jump to chunk edge (top/bottom)
- at chunk edge → jump to next/prev chunk

## .why better than chunk-to-chunk

standard `]c`/`[c` jumps to top of next chunk. problem: if chunk is 50 lines, you land at line 1 and must scroll to see the rest.

boundary nav solves this:
1. first press → bottom of current chunk (see full context)
2. second press → top of next chunk (ready to review)

result: never land mid-chunk unsure where it ends. always at a boundary with full visibility.

## .flow example

```
chunk A (lines 10-25)
chunk B (lines 40-60)
chunk C (lines 80-85)
```

cursor at line 15 (inside chunk A):
- ctrl+d j → line 25 (bottom of A)
- ctrl+d j → line 40 (top of B)
- ctrl+d j → line 60 (bottom of B)
- ctrl+d j → line 80 (top of C)

cursor at line 50 (inside chunk B):
- ctrl+d k → line 40 (top of B)
- ctrl+d k → line 25 (bottom of A)
- ctrl+d k → line 10 (top of A)

## .implementation

uses `navigate_diff_boundary(direction, get_chunks_fn, fallback_fn)`:
- get_chunks_fn returns `[{start, fin}, ...]` for each chunk
- checks cursor position relative to chunk boundaries
- moves to boundary or next chunk accordingly
- fallback_fn called if cursor not in any chunk

two chunk detection methods:
- `get_gitsigns_chunks()` - for normal buffers via gitsigns hunks
- `get_diff_hl_chunks()` - for diff buffers via vim's diff_hlID()

## .keybinds

| context | key | action |
|---------|-----|--------|
| normal buffer | ctrl+d j | next boundary (gitsigns) |
| normal buffer | ctrl+d k | prev boundary (gitsigns) |
| codediff buffer | ctrl+d j | next boundary (diff hl) |
| codediff buffer | ctrl+d k | prev boundary (diff hl) |
18 changes: 13 additions & 5 deletions .claude/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -148,11 +148,17 @@
"Bash(npx rhachet run --skill git.commit.bind:*)",
"Bash(npx rhachet run --skill git.commit.bind get)",
"Bash(npx rhachet run --skill git.commit.set:*)",
"Bash(npx rhachet run --skill git.commit.set -m 'fix(api): validate input\n\n- added input schema\n- added error handler')",
"Bash(npx rhachet run --skill git.commit.set -m $MESSAGE)",
"Bash(npx rhachet run --skill git.commit.set --mode apply -m $MESSAGE)",
"Bash(npx rhachet run --skill git.commit.set --push -m $MESSAGE)",
"Bash(npx rhachet run --skill git.commit.set --unstaged ignore -m $MESSAGE)",
"Bash(echo $MESSAGE | npx rhachet run --skill git.commit.set -m @stdin)",
"Bash(echo $MESSAGE | npx rhachet run --skill git.commit.set -m @stdin --mode apply)",
"Bash(echo $MESSAGE | npx rhachet run --skill git.commit.set -m @stdin --mode apply --push)",
"Bash(echo $MESSAGE | npx rhachet run --skill git.commit.set -m @stdin --unstaged ignore)",
"Bash(echo $MESSAGE | npx rhachet run --skill git.commit.set -m @stdin --unstaged include)",
"Bash(rhx git.commit.set:*)",
"Bash(echo $MESSAGE | rhx git.commit.set -m @stdin)",
"Bash(echo $MESSAGE | rhx git.commit.set -m @stdin --mode apply)",
"Bash(echo $MESSAGE | rhx git.commit.set -m @stdin --mode apply --push)",
"Bash(echo $MESSAGE | rhx git.commit.set -m @stdin --unstaged ignore)",
"Bash(echo $MESSAGE | rhx git.commit.set -m @stdin --unstaged include)",
"Bash(npx rhachet run --skill git.commit.push:*)",
"Bash(npx rhachet run --skill show.gh.action.logs:*)",
"Bash(npx rhachet run --skill show.gh.test.errors:*)",
Expand All @@ -168,6 +174,8 @@
"Bash(npx rhachet run --skill condense --from 'briefs/**/*.md' --mode apply)",
"Bash(npx rhachet run:*)",
"Bash(npx rhachet:*)",
"Bash(rhx:*)",
"Bash(npx rhx:*)",
"Bash(npm view:*)",
"Bash(npm list:*)",
"Bash(npm remove:*)",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
branch: vlad/cloud-gpus
research: cloud-gpus
bound_by: init.research skill
23 changes: 23 additions & 0 deletions .research/v2026_02_26.cloud-gpus/0.wish.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
wish =

we want to research the cost of gpu's to serve llms (e.g., think rtx 3090 level) on cloud hosts like aws

specifically, for example, what would the options be to host our own qwen3.5 inference machine on aws hardware?

---

context, originally thought about homelab w/ a $1.5k gpu setup

but, realize that we want our clones to run on the cloud anyway, so ideally we'd be able to rent that compute too

hopefully, we can keep it all in aws to keep our network latency minimal

---

so, what are the options for doing that?

are there any hosted options by aws where we give them an open-weights model and they run inference for us?

are there any provisioned options by aws where we rent a gpu-strapped ecs or something?

are there any serverless options?
12 changes: 12 additions & 0 deletions .research/v2026_02_26.cloud-gpus/1.1.probes.aim.internal.stone
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
read the wish in .research/v2026_02_26.cloud-gpus/0.wish.md

imagine probe questions to research the topics that would enable fulfillment of that wish

gather from internalized knowledge — formulate questions from what you already know:
- atleast 21 questions we should ask
- divergent thought domains to explore (parallel fields, analogies, metaphors)
- inversions to consider (what if the opposite were true? what would fail?)

---

emit to .research/v2026_02_26.cloud-gpus/1.1.probes.aim.internal.v1.i1.md
71 changes: 71 additions & 0 deletions .research/v2026_02_26.cloud-gpus/1.1.probes.aim.internal.v1.i1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# probes.aim.internal — cloud gpu options for llm inference

## core infrastructure questions

1. what ec2 instance types have gpus suitable for llm inference (p3, p4, p5, g4, g5, g6 families)?
2. what is the hourly cost range for gpu-enabled ec2 instances (on-demand vs spot vs reserved)?
3. what vram capacities are available across aws gpu instances (16gb, 24gb, 40gb, 80gb)?
4. can qwen 3.5 (7b, 14b, 32b, 72b variants) fit in vram on different instance types?
5. what is the cold start time to spin up a gpu ec2 instance from stopped state?

## managed/hosted inference options

6. does aws sagemaker support import of your own open-weights model for inference?
7. what is aws bedrock's model selection — does it include qwen or only closed models?
8. what are sagemaker inference endpoint costs vs raw ec2 gpu costs?
9. does aws have any serverless gpu inference option (like lambda but with gpu)?
10. what is aws inferentia/trainium — are these viable alternatives to nvidia gpus for transformer inference?

## architecture patterns

11. can ecs/eks run gpu-enabled containers for inference workloads?
12. what is the latency difference between same-region ec2 gpu vs cross-region api call?
13. can we use autoscale groups with gpu instances for burst inference capacity?
14. what container images/runtimes work for gpu inference (nvidia-docker, cuda)?
15. how do we handle model load time vs inference time for cost optimization?

## cost optimization

16. what are spot instance interruption rates for gpu instances historically?
17. can we use savings plans or reserved instances for gpu workloads?
18. what is the breakeven point: rent cloud gpu vs own $1.5k homelab gpu?
19. are there cheaper non-aws alternatives (lambda labs, vast.ai, runpod) worth consideration?
20. what is the cost per 1M tokens inference on different instance types?
21. does aws have any free tier or credits for gpu workloads?

## divergent thought domains

### parallel fields
22. how do cloud game services (geforce now, shadow) provision gpus — lessons for inference?
23. how do crypto mine operations optimize gpu cost-efficiency — applicable patterns?
24. how do scientific compute clusters (hpc) handle gpu schedule — relevant for inference queues?

### analogies
25. is llm inference more like a database query (stateless, cacheable) or a render job (compute-bound, unique)?
26. can we treat model weights like a "warm cache" that stays resident while instances sleep?
27. is the gpu landscape like early cloud compute — rapid commoditization underway?

### metaphors
28. if gpus are "chefs" and models are "recipes", how do we minimize chef idle time?
29. if inference is a "toll road", when does road ownership make sense vs toll payment?

## inversions (what if opposite were true?)

30. what if network latency doesn't matter — would that change the homelab vs cloud calculus?
31. what if gpu costs increase rather than decrease — how do we hedge?
32. what if qwen 3.5 is too large for cost-effective inference — what smaller models suffice?
33. what if serverless gpu becomes mainstream in 2026 — should we wait vs build now?
34. what if quantization (int8, int4) makes consumer gpus viable — does cloud still win?
35. what if the real bottleneck is memory bandwidth, not vram capacity?

## risk/failure modes

36. what happens if our gpu instance gets terminated mid-inference (spot)?
37. what if aws gpu capacity is constrained in our region — alternatives?
38. what are the vendor lock-in risks of sagemaker use vs raw ec2?
39. what if model updates require instance type changes — migration complexity?
40. what security considerations exist for inference on shared cloud hardware?

---

summary: 40 probe questions across infrastructure, managed services, architecture, cost, divergent domains, and inversions to guide external research on cloud gpu options for llm inference.
12 changes: 12 additions & 0 deletions .research/v2026_02_26.cloud-gpus/1.2.probes.aim.external.stone
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
read the wish in .research/v2026_02_26.cloud-gpus/0.wish.md
read the internal probes in .research/v2026_02_26.cloud-gpus/1.1.probes.aim.internal.v1.i1.md

websearch to discover what additional questions to ask based on external sources:
- what do experts in this domain ask?
- what are the known unknowns?
- what controversies or debates exist?
- what questions did the internal probes miss?

---

emit to .research/v2026_02_26.cloud-gpus/1.2.probes.aim.external.v1.i1.md
99 changes: 99 additions & 0 deletions .research/v2026_02_26.cloud-gpus/1.2.probes.aim.external.v1.i1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# probes.aim.external — cloud gpu options for llm inference

## what experts ask (derived from external sources)

### inference server selection
41. vllm vs tgi vs tensorrt-llm — which inference server yields best gpu utilization for qwen?
42. does vllm's pagedattention mechanism (85-92% gpu utilization) justify its complexity vs tgi (68-74%)?
43. what continuous batch configuration maximizes throughput without latency degradation?
44. which inference server has best support for qwen model family specifically?

### quantization and optimization
45. at what quality threshold does int4 quantization become unacceptable for production inference?
46. does speculative decode work well with qwen models to reduce latency?
47. what kv-cache offload strategies reduce vram requirements without throughput loss?
48. can awq/gptq quantized qwen 32b fit on a single 24gb gpu (rtx 3090/4090 class)?

### aws inferentia/trainium deep dive
49. what is the neuron sdk compile time overhead for qwen model deployment?
50. which qwen model sizes have verified compatibility with inferentia2 (inf2 instances)?
51. what is the real-world cost savings of inf2 vs p4d/g5 for llm inference (claimed 70%)?
52. does inferentia2 support dynamic sequence lengths or only fixed-shape inference?

### managed service tradeoffs
53. sagemaker charges when not in use — what is the minimum viable autoscale-to-zero pattern?
54. sagemaker multi-container endpoints claim 80% cost reduction — real-world experience?
55. what is the sagemaker inference component cold start time vs raw ec2?
56. does aws bedrock support qwen, or only closed models (anthropic, meta, cohere)?

## known unknowns (gaps in current research)

57. what is the actual token/second throughput for qwen 32b on g5.xlarge (a10g) vs p4d (a100)?
58. how do tensor parallelism configurations affect cost-efficiency for multi-gpu inference?
59. what is the real interruption rate for p4d/g5 spot instances in us-east-1 in 2025-2026?
60. how does aws capacity reservation work for gpu instances — lead time, minimum commitment?
61. what is the practical latency difference between same-vpc inference vs bedrock api call?

## controversies and debates in the field

### cloud vs self-host breakeven
62. the ~3.4 year breakeven for homelab (6000 hours) — does this account for gpu depreciation?
63. hidden costs debate: staff (70-80% of tco) — applicable to small teams with automation?
64. when does "2-3x cloud premium" become worth it vs operational overhead?

### specialized vs commodity hardware
65. is aws inferentia "mature enough" or still risky for production llm inference (2025 status)?
66. nvidia tax debate: are a100/h100 worth premium over consumer gpus (3090/4090) for inference?
67. when does memory bandwidth (not vram capacity) become the bottleneck — model size threshold?

### provider reliability vs cost
68. vast.ai "lowest price but sleep loss" tradeoff — acceptable for non-critical inference?
69. runpod "reliable but not enterprise" — where is the quality threshold?
70. lambda labs "excellent but out of capacity" — is capacity improved in 2026?

## questions internal probes missed

### deployment specifics
71. what container base image works best for qwen inference on aws (nvidia/cuda, aws dlami)?
72. what is the model download time for qwen 32b from huggingface to ec2 instance?
73. how do you persist model weights across spot interruptions (efs, s3, instance store)?
74. what health check and readiness probe patterns work for gpu inference containers?

### scale patterns
75. horizontal vs vertical scale for inference — when does multi-instance beat multi-gpu?
76. what request queue depth triggers autoscale without latency spikes?
77. how do you handle inference in scale-up window (queue, reject, degrade)?

### observability
78. what metrics matter most for gpu inference cost optimization (utilization, queue depth, latency)?
79. how do you detect gpu memory leaks in long-lived inference containers?
80. what alert thresholds indicate "add more capacity" vs "optimize configuration"?

### security and compliance
81. is shared gpu tenancy (spot, shared instances) acceptable for sensitive inference workloads?
82. what data residency options exist for gpu inference in aws (regions, dedicated hosts)?
83. how do you audit inference requests/responses for compliance (logs, retention)?

### cost attribution
84. how do you attribute inference costs to individual customers/use cases?
85. what granularity does aws provide for gpu instance meters?
86. how do you forecast gpu costs with variable inference load patterns?

---

## sources

- [gpu economics 2026 - dev.to](https://dev.to/kaeltiwari/gpu-economics-what-inference-actually-costs-in-2026-2goo)
- [gpu procurement guide - bentoml](https://www.bentoml.com/blog/where-to-buy-or-rent-gpus-for-llm-inference)
- [cloud vs on-prem tco - latitude](https://latitude.so/blog/cloud-vs-on-prem-llms-long-term-cost-analysis)
- [qwen gpu requirements - apxml](https://apxml.com/posts/gpu-system-requirements-qwen-models)
- [sagemaker vs ec2 cost - generativeai.pub](https://generativeai.pub/the-cost-of-inference-aws-sagemaker-vs-ec2-c7ce5d9c99d2)
- [vllm vs tgi comparison - inferless](https://www.inferless.com/learn/vllm-vs-tgi-the-ultimate-comparison-for-speed-scalability-and-llm-performance)
- [vllm vs tgi arxiv paper](https://arxiv.org/abs/2511.17593)
- [aws inferentia vs trainium vs gpu - zircon.tech](https://zircon.tech/blog/aws-ai-infrastructure-inferentia2-vs-trainium-vs-gpu-for-production-workloads/)
- [lambda labs vs runpod vs vast.ai - lyceum](https://lyceum.technology/magazine/lambda-labs-vs-runpod-vs-vast-ai/)
- [top cloud gpu providers 2026 - runpod](https://www.runpod.io/articles/guides/top-cloud-gpu-providers)

---

summary: 46 additional probe questions (41-86) derived from external sources — inference server selection, quantization, aws custom silicon, managed services, known unknowns, field controversies, and gaps in internal probes.
11 changes: 11 additions & 0 deletions .research/v2026_02_26.cloud-gpus/1.3.probes.aim.blend.stone
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
read the internal probes in .research/v2026_02_26.cloud-gpus/1.1.probes.aim.internal.v1.i1.md
read the external probes in .research/v2026_02_26.cloud-gpus/1.2.probes.aim.external.v1.i1.md

blend the probes into one unified set:
- dedupe overlaps
- group by theme
- order by priority (most critical gaps first)

---

emit to .research/v2026_02_26.cloud-gpus/1.3.probes.aim.blend.v1.i1.md
Loading