docs/llms.txt at main · mixpeek/docs · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# Mixpeek Documentation

> Multimodal AI infrastructure that makes unstructured data searchable and AI-ready.

## Documentation Overview

Mixpeek is a multimodal data warehouse and retrieval platform. This documentation covers the complete platform for processing images, videos, audio, PDFs, and text into searchable, AI-ready assets.

## Quick Links

- Full Documentation: https://mixpeek.com/docs
- API Reference: https://mixpeek.com/docs/api-reference
- OpenAPI Spec: https://api.mixpeek.com/docs/openapi.json
- Quickstart: https://mixpeek.com/docs/overview/quickstart

## Core Concepts

### Terminology
- **Namespace** = Qdrant collection (tenant isolation boundary)
- **Bucket** = Raw file storage (S3/GCS/Azure/R2/Wasabi/Tigris)
- **Collection** = Processing pipeline with configured feature extractors
- **Document** = Qdrant point + payload; `document_id` for app logic
- **Retriever** = Multi-stage configurable search pipeline
- **Taxonomy** = Hierarchical classification system
- **Cluster** = Automatic semantic grouping of content
- **Plugin** = Custom Python-based feature extractor
- **Manifest** = Infrastructure-as-code resource definition

### Data Ingestion Flow
Never insert directly into Qdrant. Always: Bucket upload → Trigger collection → Wait for batch.

### Storage Tiering
Collections have lifecycle states: `active` (Qdrant + S3 Vectors), `cold` (S3 only), `archived` (metadata only).

## Essential Documentation Pages

### Getting Started
- /overview/introduction - Platform overview
- /overview/quickstart - 5-minute getting started guide
- /overview/concepts - Core terminology
- /overview/architecture - System architecture
- /overview/data-model - Data model and relationships

### Data Ingestion
- /ingestion/objects - Working with raw objects
- /ingestion/uploads - File upload methods
- /ingestion/buckets - Bucket management and syncs
- /ingestion/collections - Collection configuration
- /ingestion/namespaces - Namespace management
- /ingestion/features - Feature overview

### Feature Extraction
- /processing/feature-extractors - Available extractors overview
- /processing/extractors/multimodal - Multimodal dense/sparse embeddings
- /processing/extractors/image - Image analysis (detection, OCR, segmentation)
- /processing/extractors/text - Text processing and embeddings
- /processing/extractors/document - PDF/document extraction
- /processing/extractors/face-identity - Face recognition and identity
- /processing/extractors/web-scraper - Web content extraction
- /processing/extractors/course-content - Educational content parsing
- /processing/extractors/passthrough - No-op extractor
- /processing/plugins - Custom Python plugin system
- /processing/model-registry - Custom model management
- /processing/batching - Batch processing and retry logic
- /processing/pipelines - Pipeline configuration

### Search & Retrieval
- /retrieval/retrievers - Retriever creation and management
- /retrieval/stages/overview - Pipeline stages overview
- /retrieval/filters - Query filtering syntax

#### Filter Stages
- /retrieval/stages/feature-search - Vector/semantic search
- /retrieval/stages/attribute-filter - Metadata filtering
- /retrieval/stages/llm-filter - LLM-based semantic filtering
- /retrieval/stages/agent-search - Autonomous agent search
- /retrieval/stages/query-expand - Query expansion

#### Sort Stages
- /retrieval/stages/sort-relevance - Score-based ordering
- /retrieval/stages/sort-attribute - Field-based ordering
- /retrieval/stages/mmr - Maximal Marginal Relevance
- /retrieval/stages/rerank - Cross-encoder reranking
- /retrieval/stages/score-normalize - Score normalization

#### Reduce Stages
- /retrieval/stages/aggregate - Group and reduce
- /retrieval/stages/sample - Random sampling
- /retrieval/stages/summarize - LLM summarization
- /retrieval/stages/limit - Top-K cutoff
- /retrieval/stages/deduplicate - Near-duplicate removal

#### Group Stages
- /retrieval/stages/group-by - Bucket by field value
- /retrieval/stages/cluster - Semantic clustering of results

#### Apply Stages
- /retrieval/stages/json-transform - Reshape/project fields
- /retrieval/stages/rag-prepare - Format for RAG injection
- /retrieval/stages/external-web-search - Augment with live web results
- /retrieval/stages/api-call - Call external HTTP endpoint
- /retrieval/stages/sql-lookup - Join with SQL data source
- /retrieval/stages/cross-compare - LLM-powered comparison
- /retrieval/stages/web-scrape - Fetch content from URLs in results
- /retrieval/stages/unwind - Flatten array fields
- /retrieval/stages/code-execution - Sandboxed Python on results

#### Enrich Stages
- /retrieval/stages/llm-enrich - Add LLM-generated fields
- /retrieval/stages/taxonomy-enrich - Apply taxonomy classification
- /retrieval/stages/document-enrich - Join with related documents
- /retrieval/stages/agentic-enrich - Autonomous enrichment

### Retriever Features
- /retrieval/interactions - Click/view/conversion tracking
- /retrieval/benchmarks - Head-to-head configuration comparison

### Relevance & Personalization
- /relevance/overview - Relevance system overview
- /relevance/interactions - Interaction signal collection
- /relevance/fusion-strategies - Weighted, RRF, learned fusion
- /relevance/learned-fusion - ML-trained fusion weights
- /relevance/evaluations - Offline evaluation datasets
- /relevance/analytics - Retrieval quality analytics

### Enrichment & Organization
- /enrichment/taxonomies - Taxonomy-based classification
- /enrichment/clusters - Automatic semantic clustering
- /enrichment/retriever-enrichments - Retriever-based enrichment

### Operations
- /operations/security - Auth, RBAC, secrets management
- /operations/webhooks - Async event delivery
- /operations/manifests - Infrastructure-as-code
- /operations/storage-tiering - Storage lifecycle management

### Best Practices
- /best-practices/schema-design - Collection and document schema
- /best-practices/feature-selection - Choosing the right extractors
- /best-practices/caching-strategies - Query and model caching
- /best-practices/cost-optimization - Reducing compute and storage costs

### Troubleshooting
- /troubleshoot/errors - Error reference
- /troubleshoot/limits - Rate limits and quotas
- /troubleshoot/common-issues - Common problems and fixes
- /troubleshoot/faq - Frequently asked questions

### Integrations
- /integrations/search-widget - Embeddable search UI
- /integrations/object-storage/s3 - AWS S3
- /integrations/object-storage/gcs - Google Cloud Storage
- /integrations/object-storage/azure-blob - Azure Blob Storage
- /integrations/object-storage/r2 - Cloudflare R2
- /integrations/object-storage/wasabi - Wasabi
- /integrations/object-storage/tigris - Tigris
- /integrations/social-media/instagram - Instagram connector
- /integrations/developer-tools/python-sdk - Python SDK
- /integrations/developer-tools/javascript-sdk - JavaScript SDK
- /integrations/developer-tools/mixpeek-cli - CLI
- /integrations/developer-tools/mcp-server - MCP server for agents

## API Authentication

All API requests require Bearer token authentication:
```
Authorization: Bearer YOUR_API_KEY
```

Namespace context is provided via header:
```
X-Namespace: ns_your_namespace_id
```

## Common API Patterns

### Create a Namespace
POST /v1/namespaces

### Create a Bucket and Collection
POST /v1/buckets
POST /v1/collections (with feature_extractors array)

### Upload and Process Data
1. POST /v1/buckets/{bucket_id}/uploads — upload file
2. Collection triggers batch processing automatically
3. GET /v1/buckets/{bucket_id}/batches/{batch_id} — poll status

### Execute Search
POST /v1/retrievers/{retriever_id}/execute
- Provide query in `inputs`
- Retriever runs configured stages pipeline

### Deploy a Custom Plugin
POST /v1/namespaces/{ns_id}/plugins — upload plugin code
POST /v1/namespaces/{ns_id}/plugins/{plugin_id}/deploy — deploy

### Apply Manifest (IaC)
POST /v1/manifest/apply — declaratively create/update resources
POST /v1/manifest/validate — validate without applying
GET /v1/manifest/export — export current state as manifest

## SDKs

- Python: `pip install mixpeek`
- JavaScript: `npm install mixpeek-sdk`
- CLI: `pip install mixpeek-cli`
- MCP Server: expose retrievers as tools for Claude and agents

## Support

- Documentation: https://mixpeek.com/docs
- GitHub: https://github.com/mixpeek
- Discord: https://discord.gg/mixpeek
- Email: support@mixpeek.com