Gateway, Routing & Token Economy
Local Gateway & API Normalization
Section titled “Local Gateway & API Normalization”src/providers/provider-client.ts implements a dedicated client class per provider wire format, all normalising to the same ProviderCompleteResponse. No external SDK is used — every API call is a raw fetch.
| Client class | Wire format | Key differences |
|---|---|---|
OpenAICompatibleProviderClient | POST /chat/completions (OpenAI Chat) | Base for Mistral, Ollama, OpenRouter, LM Studio, custom |
OpenAIProviderClient | POST /responses (OpenAI Responses API) | Different message serialisation; no streaming (falls back to chat) |
AnthropicProviderClient | POST /messages (anthropic-version: 2023-06-01) | x-api-key header; tool_use blocks; cache_creation_input_tokens |
OllamaProviderClient | /api/tags for listing; delegates completions | Normalises Ollama model list format |
OpencodeZenProviderClient | OpenAI-compatible + session/project headers | Free tier; tool-name sanitiser (. → _); lazy-loaded optional module |
SSE streaming is implemented in OpenAICompatibleProviderClient.completeStream(). The parser handles two SSE wire formats simultaneously:
- OpenAI-compatible —
choices[].deltachunks + finalusagewithinclude_usage: true - Anthropic-format —
content_block_start / content_block_delta / message_deltaevent types (emitted by some proxy-routed models)
The parser detects the format per-event via isAnthropicStreamFormat() and merges usage blocks from both sources, preferring Anthropic values when present.
Thinking / reasoning normalisation per provider:
| Provider | Mechanism | Param sent |
|---|---|---|
| Anthropic | thinking: { type: "enabled", budget_tokens: N } | Named tiers → fixed budgets (low=4k, medium=10k, high=16k, max=32k) |
| OpenAI o-series | reasoning_effort: low|medium|high | Detected by ^o\d model pattern + blocklist |
| Mistral Magistral | Built-in (no param) | Nothing sent; response parsed from thinking array chunks |
| Mistral small/medium | Optional reasoning_effort | Sent only when thinkBudget is explicitly set |
Smart Routing & Provider Chains
Section titled “Smart Routing & Provider Chains”DefaultProviderGateway supports two routing modes in every call:
Profile routing (profileId) — direct call to one configured profile.
Chain routing (chainId) — iterates an ordered list of { profileId, model? } entries. The first successful response is returned; all others are tried silently on failure. Each entry can independently override the model, so a single chain can span multiple providers and model tiers.
Chains are stored in ~/.umbra/providers.json alongside profiles and are managed via the same CRUD API as profiles.
Retry policy (applies to both modes): up to 2 attempts on 429, HTTP 5xx, AbortError, or fetch failed. Backoff: 1 s × attempt index.
Prompt & Tool Output Compression
Section titled “Prompt & Tool Output Compression”src/utils/compression.ts provides a five-level compression system applied to message history before each provider call.
Levels:
| Level | Machine output (max lines) | Search results | Prose |
|---|---|---|---|
off | unlimited | unlimited | unchanged |
lite | 500 | 30 files / 10 snippets | unchanged |
standard | 200 | 20 files / 5 snippets | unchanged |
aggressive | 50 | 10 files / 3 snippets, no context | filler words stripped |
ultra | 20 | 5 files / 1 snippet, no context | filler words stripped |
condenseMachineOutput — head + tail preservation with automatic critical-line extraction from truncated sections. Lines matching Error:, SyntaxError:, TypeError:, exit code [1-9], stack frames are always rescued even when the surrounding section is dropped.
compressToolOutput — unified entry point: if the tool result is JSON with fileBuckets, applies compressSearchResults (sorts by match count, caps files and snippets, optionally strips context lines). All other tool outputs go through condenseMachineOutput.
condenseProse — strips common English filler words (actually, basically, literally, very, just, etc.) in aggressive/ultra modes. Applied to user/assistant messages in history.
Level selection per run mode:
| Mode | Compression level |
|---|---|
full | always off (128 k context budget) |
exec | always aggressive (maximise harness iterations) |
agent / plan | from settings.json → compression.level (default: standard) |
Usage Accounting, Cost & Route Health
Section titled “Usage Accounting, Cost & Route Health”Every provider call appends one line to ~/.umbra/usage.jsonl via UsageLogger. The file is an append-only JSONL — one JSON object per line — and survives daemon restarts.
UsageRecord fields:
| Field | Description |
|---|---|
requestId | Random per-request ID matching debug events |
profileId / chainId | Which profile or chain was used |
sessionId / threadId | Conversation identifiers |
model / route | Model ID and profileId/model composite |
inputTokens / outputTokens / totalTokens | From provider usage block |
reasoningTokens | Thinking tokens (Anthropic / OpenAI o-series) |
cacheReadTokens / cacheWriteTokens | Anthropic prompt cache hit/write |
costEstimate | USD, computed from ModelsRegistry pricing |
contextLimit / contextPercent | Context window size and fill % |
source | actual (from provider) / estimated (local) / mixed |
status | success or failed |
Aggregation methods on UsageLogger:
getStats()— totals across all recordsgetStatsByModel()— per-model breakdown withavgCostPerRequestgetStatsByProvider()— per-profile breakdowngetStatsBySession()— per-session breakdowngenerateReport()— formatted text report sorted by total cost
Global CLI Launch & Working Directory Contract
Section titled “Global CLI Launch & Working Directory Contract”bin/umbra.js is the global entry-point shim. It selects the execution strategy at runtime without recompilation:
1. src/cli/main.ts exists? → node --import tsx/esm src/cli/main.ts (dev mode, no build needed)2. dist/cli/main.js exists? → node dist/cli/main.js (installed package)3. Neither → exit 1 with instructionsCLI flags handled by the shim (stripped before passing to cac):
| Flag | Effect |
|---|---|
--debug | Sets UMBRA_DEBUG_SIDECAR=1; opens a sidecar debug console on Windows |
--project <path> | Sets UMBRA_PROJECT_PATH env var; overrides CWD for all project-path resolution |
Signals SIGINT, SIGTERM, SIGHUP are forwarded to the child process so Ctrl-C works correctly even when the shim is the process group leader.
On Windows, .cmd scripts are spawned via cmd.exe /d /s /c to avoid Node.js DEP0190 and handle shim quoting correctly (src/cli/process-runner.ts).
Permissions & Access Policies
Section titled “Permissions & Access Policies”src/core/permissions.ts implements a three-tier access control system.
Evaluation order (most permissive first):
1. exec-full mode → allow all (sandbox assumption)2. chat-readonly mode + destructive tool → deny3. Stored rules (most recently added wins) → allow / deny / allow_always4. Non-destructive tool → allow without prompt5. Interactive prompt → y / n / a (allow_always)Destructive tools (always require evaluation outside exec-full):
shell.exec, fs.write, fs.edit, fs.cd, git.status, git.diff, git.apply, git.commit, git.push, git.pull, web.search
Rule storage:
- Global rules:
~/.umbra/permissions.json— applied to all projects - Project rules:
<project>/umbra.permissions.json— applied whenprojectPathmatches
Rule matching supports exact tool name (shell.exec), namespace wildcard (shell.*), or global wildcard (*). Rules have an optional expiresAt ISO timestamp.
WorkspaceTrustManager tracks trusted directories in ~/.umbra/trusted-paths.json. fs.cd to a path not under any trusted ancestor always triggers an interactive prompt regardless of mode. Choosing “always” registers the path as trusted.
Interactive prompt (TTY only) shows the tool name and a human-readable action summary, then reads y / n / a. In non-TTY environments (piped stdin, CI, tests), the answer is automatically deny.
All decisions are appended to ~/.umbra/debug/permissions.jsonl for audit.
Model Catalogue & UI Normalization
Section titled “Model Catalogue & UI Normalization”ModelsRegistry resolves capabilities for any model ID at runtime:
models.dev/api.json— authoritative dataset; flattened toprovider/modeltuples; cached for 5 minutes in-process- HuggingFace model API — fallback for models not in models.dev
- Heuristics — name-pattern rules (
-vision,-instruct,-r1,o\d, context window from common model families)
Capabilities resolved: contextWindow, supportsTools, supportsVision, supportsReasoning, supportsStructuredOutput, supportsAttachments, supportsTemperature, longContext (>100 k), interleaved (reasoning_content vs reasoning_details), inputModalities, outputModalities, pricingPerMillion (input, output, cacheRead, cacheWrite).
Mistral model dedup in listModels() is multi-pass:
- Filter non-chat and archived models
- Exact-ID dedupe
- Family grouping via iterative suffix stripping (
-latest,-2604,-3.5,-3-5,-v2,-3) — handles compound suffixes likemistral-medium-3-5-2604in two passes - Prefer
-latestalias over specific version within same family - Score and sort by tier (Magistral → Large → Medium → Small → Devstral → Codestral → Ministral → Pixtral → Voxtral → Open → penalty for vibe-cli / labs- / deprecated)
TUI / Agent & Plan Modes
Section titled “TUI / Agent & Plan Modes”The TUI is built with Ink (React for terminal). Key surfaces:
- Thread list — paginated, searchable, archivable; forking creates a new thread from the current session state
- Session view — streaming assistant output with
reasoning_deltaandassistant_deltaevents shown separately; tool call / result panels; permission approval dialog - Provider selector — profile list with status badges (
connected/available/unavailable); inline connect/test/delete - Memory panel — per-project memory text; toggle
useMemories/generateMemories
Simple-chat detection in resolveRunModeContract(): prompts of ≤8 words matching a hardcoded greeting set (hi, hello, thanks, привет, спасибо, etc.) get toolPreset: null and no compression. This avoids sending tool schemas and compression overhead for conversational turns.
Mode selection at the CLI / API level:
umbra(no args) → opens TUI inagentmodeumbra exec "<task>"→ headlessexecmode (harness loop, 30-min timeout)umbra task add "<task>"→ queues a backgroundagentrun- HTTP API
POST /runwithmode: "plan"→ structured JSON plan output
Markdown Visual Processing
Section titled “Markdown Visual Processing”src/cli/tui/markdown.ts renders markdown to ANSI escape sequences for terminal display. The renderer is line-by-line — no AST parsing — trading full CommonMark compliance for zero dependencies and instant output.
Block-level transforms (applied first):
| Input | Output |
|---|---|
# Heading | Bold + cyan text |
## Heading | Bold + yellow text |
--- / === | Dim ──────────────────────────────────────── |
- item | Cyan - bullet |
1. item | Yellow numbered item |
> quote | Dim | quote |
Inline transforms (applied after block transforms):
| Syntax | Output |
|---|---|
`code` | Yellow text |
**bold** | Bold text |
*italic* | Dim text |
[text](url) | Cyan text + dim (url) |
All transforms reset ANSI state ([0m) at the end of each match so adjacent styles don’t bleed across tokens.