[Feature] Support for cerebras or other custom providers

## **Is your feature request related to a problem?**

I am currently frustrated by the "Latency Tax" and throughput bottlenecks when performing high-volume AI tasks, specifically **End-to-End Coding** (where an agent refactors multiple files simultaneously) and **GraphRAG indexing**.  
Standard providers like OpenAI or Anthropic, while capable, are often too slow for an interactive "Vibe Coding" workflow, leading to frequent breaks in developer flow while waiting for large code blocks to generate. Furthermore, when running **GraphRAG**, the sheer volume of Extraction/Transformation calls required to build a knowledge graph is either prohibitively expensive on standard pay-per-token models or takes hours to complete due to the serial nature of traditional inference providers.

## **Describe the Solution You'd Like**

I would like native support for the **Cerebras Inference API** and the **Cerebras Code** subscription tier.  
Key features should include:

* **Direct API Integration:** A first-class provider option for the Cerebras OpenAI-compatible endpoint (https://api.cerebras.ai/v1).  
* **High-Throughput UI Optimization:** The ability for the UI to handle the **2,000+ tokens/second** stream without lag or memory leaks in the editor.  
* **Model Presets:** Native support for llama-3.3-70b, llama-3.1-8b, and the specialized cerebras-code models.  
* **Cerebras Code Quota Management:** Specific handling for the **Cerebras Code Pro/Max** tiers, allowing users to leverage their massive daily token quotas (up to 120M tokens/day) without triggering standard rate-limit logic designed for pay-per-token tiers.

## **Describe Alternatives You've Considered**

* **Groq:** While Groq offers excellent Time-To-First-Token (TTFT), its rate limits and total throughput for massive multi-file writes often fall behind Cerebras' CS-3 hardware capabilities.  
* **Mercury 2 (Inception Labs):** A very fast Diffusion-based LLM, but it lacks the established "All-you-can-eat" flat-fee model for heavy developer usage.  
* **Generic OpenAI-Compatible Provider:** While I can currently use a "Custom" field, it lacks the specialized system prompts, MCP (Model Context Protocol) optimizations, and specific rate-limit handling required to truly leverage Cerebras’ speed.

## **Additional Context**

Cerebras is currently the world’s fastest inference engine, powered by the CS-3 Wafer-Scale Engine.

* **For GraphRAG:** The ability to process millions of tokens in seconds is the difference between a graph taking 5 minutes to build vs. 5 hours.  
* **For Coding:** It enables "instant" multi-file edits and refactors that feel like local file operations rather than cloud-based streaming.  
* **Reference:** [Cerebras Developer Documentation](https://inference-docs.cerebras.ai/)

## **Would You Like to Contribute?**

* \[X\] I'm interested in implementing this feature  
* \[ \] I'd like guidance on how to implement this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support for cerebras or other custom providers #49

Is your feature request related to a problem?

Describe the Solution You'd Like

Describe Alternatives You've Considered

Additional Context

Would You Like to Contribute?

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Support for cerebras or other custom providers #49

Description

Is your feature request related to a problem?

Describe the Solution You'd Like

Describe Alternatives You've Considered

Additional Context

Would You Like to Contribute?

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions