MiniLLMLib (Rust)
A minimalist, async-first Rust library for talking to Large Language Models over HTTP, with one consistent API across every provider.
The headline idea: ChatNode::root(...).chat(...) is identical no matter what
is behind it. A Provider owns the entire wire dialect (endpoint, auth,
request body, response and stream envelope, cost accounting). Your code only ever
deals in normalized types, so switching from OpenRouter to OpenAI, to a native
Anthropic key, to a Claude subscription, or to your own self-hosted server is a
one-line change.
What's here
- Conversation trees. A conversation is a tree of
ChatNodehandles. Linear chats, branching, and prebuilt history all use the same structure. - Multiple providers. OpenRouter, OpenAI, native Anthropic (
/v1/messages), a generic OpenAI-compatible provider for self-hosted servers, and your own hand-writtenimpl Providerfor any other wire. - Streaming over SSE, with an idle-timeout that won't kill a long live generation but fails loudly on a dead connection.
- Honest cost tracking. Per-provider usage and cost, with disjoint
cached/uncached/cache-write token buckets and a
CostResolution(Resolved/Unpriced/Unknown) that never reports a fake$0. - Prompt caching, marked on the tree and enforced per-provider.
- Claude subscription auth: use your Pro/Max plan instead of an API key.
- JSON repair for malformed model output.
Two layers of documentation
| Layer | What | Where |
|---|---|---|
| This guide | Tutorials, patterns, worked examples | the pages on the left |
| API reference | Every public type, method, and signature | docs.rs/minillmlib |
The guide teaches you how to use the library; the API reference (auto-generated from the source by docs.rs) is the exhaustive signature lookup. Start with Quickstart.
Quickstart
Install
# Cargo.toml
[dependencies]
minillmlib = "0.3"
tokio = { version = "1", features = ["full"] }
One call
use minillmlib::{ChatNode, GeneratorInfo}; #[tokio::main] async fn main() -> minillmlib::Result<()> { // Pick a provider. OpenRouter reads OPENROUTER_API_KEY from the environment. let generator = GeneratorInfo::openrouter("google/gemini-2.5-flash-lite"); // A conversation is a tree; `root` is the system prompt. let root = ChatNode::root("You are a helpful assistant. Be brief."); // `chat` = add a user message + get the assistant reply (returns the new node). let answer = root.chat("Say hello in five words.", &generator).await?; println!("{}", answer.message.text().unwrap_or("")); Ok(()) }
That is the 80% case. Swapping the provider is a one-line change and nothing else moves:
#![allow(unused)] fn main() { use minillmlib::GeneratorInfo; GeneratorInfo::openrouter("google/gemini-2.5-flash-lite"); // OPENROUTER_API_KEY GeneratorInfo::openai("gpt-4o-mini"); // OPENAI_API_KEY GeneratorInfo::anthropic("claude-haiku-4-5"); // ANTHROPIC_API_KEY, native /v1/messages GeneratorInfo::claude_subscription("claude-haiku-4-5"); // your Pro/Max plan, no API key GeneratorInfo::custom("my", "http://localhost:8000/v1", "m"); // your own OpenAI-compatible server }
See Providers for what each does, and Custom Providers for connecting your own server.
A multi-turn conversation
Each chat returns the assistant node; chain from it to continue the thread.
use minillmlib::{ChatNode, GeneratorInfo}; #[tokio::main] async fn main() -> minillmlib::Result<()> { let gen = GeneratorInfo::openrouter("google/gemini-2.5-flash-lite"); let root = ChatNode::root("You are a terse assistant."); let a1 = root.chat("What's the capital of France?", &gen).await?; let a2 = a1.chat("And its population, roughly?", &gen).await?; println!("{}", a2.message.text().unwrap_or("")); // a2 knows its whole history: a2.thread() is the full root-to-leaf message list. Ok(()) }
Errors
Every fallible call returns minillmlib::Result<T>, an alias for
Result<T, MiniLLMError>. The library fails loudly: an auth/validation error,
a malformed response, or an exhausted retry surface as a typed MiniLLMError,
never a silent empty success.
Conversation Trees
A conversation is a tree of ChatNode handles. Each node holds one Message;
children are alternate continuations. A linear chat is just a tree with one
branch. Holding any node keeps its whole ancestor chain (and the tree) alive.
Building a thread
add_user / add_assistant each return the new node, so you chain them:
#![allow(unused)] fn main() { use minillmlib::ChatNode; let root = ChatNode::root("You are a terse assistant."); let leaf = root .add_user("What's the capital of France?") .add_assistant("Paris.") .add_user("And Germany?") .add_assistant("Berlin.") .add_user("And Italy?"); // the turn we want answered }
leaf.thread() is the full [system, user, assistant, user, assistant, user]
message list from root to leaf.
Completing from any node
node.complete(generator, params) uses node's root-to-node path as the prompt
and appends the reply as a child of node, returning the new assistant node.
#![allow(unused)] fn main() { use minillmlib::{ChatNode, GeneratorInfo}; async fn run(leaf: ChatNode, gen: GeneratorInfo) -> minillmlib::Result<()> { let answer = leaf.complete(&gen, None).await?; // None = default per-request params println!("{}", answer.message.text().unwrap_or("")); Ok(()) } }
You can complete from any node, not just the leaf, to branch off it. The whole root-to-that-node path is the context.
Prebuilt history from a Vec<Message>
When you already have a message list, from_messages builds the linear chain and
hands back (root, leaf). Complete from the leaf.
#![allow(unused)] fn main() { use minillmlib::{ChatNode, GeneratorInfo, Message}; async fn run(gen: GeneratorInfo) -> minillmlib::Result<()> { let history = vec![ Message::system("You are a terse assistant."), Message::user("What's the capital of France?"), Message::assistant("Paris."), Message::user("And Germany?"), Message::assistant("Berlin."), Message::user("And Italy?"), ]; let (_root, leaf) = ChatNode::from_messages(&history)?; let answer = leaf.complete(&gen, None).await?; Ok(()) } }
Ownership: keep a handle
The tree lives in a shared arena that stays alive as long as you hold any
handle into it. from_messages returns both root and leaf precisely so you
don't accidentally drop the only handle. When you chain add_user/add_assistant,
holding the final node is enough (it keeps its whole ancestor chain). Drop every
handle and the thread is freed.
Saving and loading threads
#![allow(unused)] fn main() { use minillmlib::ChatNode; fn run(leaf: ChatNode) -> minillmlib::Result<()> { leaf.save_thread("conversation.json")?; let (root, leaf) = ChatNode::from_thread_file("conversation.json")?; Ok(()) } }
Providers
A GeneratorInfo bundles a model, a base URL, an auth strategy, and a
Provider (the wire dialect). The provider owns everything that differs between
APIs; your calling code never changes. The crate ships these presets:
| Preset | Wire | Auth (env var) | Cost |
|---|---|---|---|
GeneratorInfo::openrouter(model) | OpenAI /chat/completions | OPENROUTER_API_KEY | native USD, with a /generation fallback |
GeneratorInfo::openai(model) | OpenAI /chat/completions | OPENAI_API_KEY | token-only (set a TokenPrice) |
GeneratorInfo::anthropic(model) | native /v1/messages, content[] | ANTHROPIC_API_KEY (x-api-key) | token-only (set a TokenPrice) |
GeneratorInfo::claude_subscription(model) | native /v1/messages | Pro/Max OAuth token | token-only ESTIMATE |
GeneratorInfo::custom(name, base_url, model) | OpenAI-compatible (default) | none unless you add one | token-only |
Auth
Auth is a strategy on the generator, mapped to concrete headers by the provider (so the same Anthropic provider serves both an API key and a subscription token):
#![allow(unused)] fn main() { use minillmlib::GeneratorInfo; let g = GeneratorInfo::openai("gpt-4o-mini"); g.clone().with_api_key("sk-..."); // provider picks the header (Bearer / x-api-key) g.clone().with_api_key_from_env("MY_KEY"); // no-op if the var is unset g.clone().with_bearer_token("token"); // always Authorization: Bearer g.clone().with_header("X-Tenant", "acme"); // any extra header }
Cost for token-only providers
OpenAI and Anthropic return token counts but no dollar amount. Attach a
TokenPrice (USD per million tokens, the unit every price sheet quotes) to
get a resolved cost; otherwise tracking reports Unpriced (never a fake $0):
#![allow(unused)] fn main() { use minillmlib::{GeneratorInfo, TokenPrice}; let gen = GeneratorInfo::anthropic("claude-haiku-4-5") .with_token_price(TokenPrice::new(1.0, 5.0)); // $1/Mtok in, $5/Mtok out }
See Cost Tracking for the full picture.
OpenRouter routing
OpenRouter-specific routing (provider order, sort, data-collection) is attached
honestly through the extra escape hatch rather than masquerading as a universal
parameter:
#![allow(unused)] fn main() { use minillmlib::{CompletionParameters, ProviderSettings}; let routing = ProviderSettings::new() .sort_by_throughput() .deny_data_collection(); let params = CompletionParameters::new() .with_openrouter_routing(routing); }
Non-OpenRouter providers simply ignore it.
Completion Parameters
Two layers of parameters:
CompletionParameters: normalized generation intent (temperature, max tokens, stop, response format, ...). NOT a wire shape: each provider'sbuild_requestmaps it to its own request body, so the same params drive any provider identically.NodeCompletionParameters: per-request behavior around the call (system prompt override, JSON repair, retry, cost tracking, caching, the wrappedCompletionParameters).
You pass NodeCompletionParameters to complete; None means defaults.
CompletionParameters
#![allow(unused)] fn main() { use minillmlib::CompletionParameters; let params = CompletionParameters::new() .with_max_tokens(512) .with_temperature(0.7) .with_stop(vec!["END".to_string()]); }
| Field | Meaning |
|---|---|
max_tokens | Provider emits its own key (max_completion_tokens, max_tokens, Anthropic's required max_tokens) |
temperature, top_p, top_k | Sampling |
frequency_penalty, presence_penalty, repetition_penalty | Penalties |
stop | Stop sequences (Anthropic stop_sequences) |
seed | Reproducibility |
response_format | Force JSON output (with_json_response()) |
reasoning | Extended-thinking effort/budget |
tools, tool_choice | Tool definitions (OpenAI-shaped, passed through) |
extra | Provider-specific keys (the honest escape hatch, e.g. OpenRouter routing) |
NodeCompletionParameters
#![allow(unused)] fn main() { use minillmlib::{CompletionParameters, NodeCompletionParameters}; let params = NodeCompletionParameters::new() .with_params(CompletionParameters::new().with_max_tokens(200)) .with_system_prompt("You are concise.") // prepend if the thread has no system message .expecting_json() // parse + repair the response as JSON .with_force_prepend("Answer: ") // make the model continue from this prefix .with_cost_tracking(true); // request usage and fire the cost callback }
| Builder | Meaning |
|---|---|
with_params(..) | The wrapped CompletionParameters |
with_system_prompt(..) | Prepend a system message if absent |
with_format_kwargs(..) / with_format_kwarg(k, v) | Fill {placeholder}s thread-wide at call time |
with_parse_json(true) / expecting_json() | Repair the response as JSON |
with_force_prepend(..) | Prime the assistant turn so the model continues it |
with_cache(true) | Auto-mark the whole prefix for caching (see Caching) |
with_cost_tracking(true) | Request and report usage/cost |
with_token_price(..) | Per-request price override |
retry, exp_back_off, back_off_time, max_back_off | Retry policy |
crash_on_refusal, crash_on_empty_response | Reject empty / no-JSON responses |
timeout_secs | Total deadline (non-streaming) or idle timeout (streaming) |
Cost Tracking
The library tracks usage and cost per request, and is honest about when a cost is actually known.
Token buckets
Input tokens are split into three disjoint, additive buckets so caching is priced correctly across every provider's differing wire conventions:
uncached_input_tokens: full-price prompt tokens,cache_read_tokens: served from a warm cache (cheap),cache_write_tokens: written to the cache this request (a premium).
Total input is the sum of the three; cost is a clean weighted sum, no subtraction.
Resolution: never a fake $0
Every reported CostInfo carries a CostResolution:
| Resolution | Meaning |
|---|---|
Resolved | The USD cost is authoritative (native, or tokens × a configured TokenPrice) |
Unpriced | Tokens are real, but no native cost and no TokenPrice was set. cost is 0.0 but must NOT be treated as a free request. Set a TokenPrice to resolve it. |
Unknown | Cost could not be determined at all (no usage, and any out-of-band query failed) |
Check resolution before trusting cost.
A callback per completion
#![allow(unused)] fn main() { use minillmlib::{ChatNode, GeneratorInfo, NodeCompletionParameters, CompletionParameters, CostInfo}; use std::sync::{Arc, Mutex}; async fn run() -> minillmlib::Result<()> { let gen = GeneratorInfo::openrouter("google/gemini-2.5-flash-lite"); let total = Arc::new(Mutex::new(0.0)); let sink = total.clone(); let params = NodeCompletionParameters::new() .with_params(CompletionParameters::new().with_max_tokens(200)) .with_cost_tracking(true) .with_cost_callback(move |info: CostInfo| { // info.cost, .prompt_tokens, .completion_tokens, // .cache_read_tokens, .cache_write_tokens, .resolution *sink.lock().unwrap() += info.cost; }); let root = ChatNode::root("You are helpful."); root.add_user("Hi").complete(&gen, Some(¶ms)).await?; println!("total spent: {}", *total.lock().unwrap()); Ok(()) } }
Enforced tracking via CompletionContext
When you want cost reporting to be structurally guaranteed (not opt-in per call),
wrap the generator in a CompletionContext and use complete_tracked. It always
reports cost through the context's async callback, and on a cancelled or
usage-less stream it resolves out-of-band (e.g. OpenRouter's /generation query)
or reports Unknown, rather than silently booking $0.
#![allow(unused)] fn main() { use minillmlib::{CompletionContext, CostInfo, AsyncCostCallback, CompletionMeta, GeneratorInfo, ChatNode}; use std::sync::Arc; async fn run() -> minillmlib::Result<()> { let generator = GeneratorInfo::openrouter("m"); let callback: AsyncCostCallback = Arc::new(|cost: CostInfo, _meta: CompletionMeta| { Box::pin(async move { // persist `cost` to your DB / metering here let _ = cost; }) }); let ctx = CompletionContext::new(generator, serde_json::json!({}), callback, "https://app", "App"); let root = ChatNode::root("You are helpful."); let _answer = root.add_user("Hi").complete_tracked(&ctx, None).await?; Ok(()) } }
For streaming, complete_streaming_tracked returns a TrackedStream that settles
cost when it finishes or is cancelled (use cancel().await for a reliable
settle; a plain drop is best-effort).
Prompt Caching
Caching intent is marked on the conversation tree; the provider decides the wire.
Anthropic emits cache_control markers (honoring its 4-breakpoint cap); OpenAI
and OpenRouter auto-cache and ignore the marks. Switch the provider and the same
code works.
Mark what to cache
#![allow(unused)] fn main() { use minillmlib::ChatNode; let root = ChatNode::root("a large, stable system prompt ..."); root.cache_breakpoint(); // cache just the system prompt // ...or cache the whole stable prefix of a conversation: let some_node = root.clone(); some_node.cache_breakpoint(); }
Or, per request, auto-mark the entire prompt prefix without touching individual nodes:
#![allow(unused)] fn main() { use minillmlib::NodeCompletionParameters; let params = NodeCompletionParameters::new().with_cache(true); }
Explicit per-node marks are always honored in addition.
Clearing marks
#![allow(unused)] fn main() { use minillmlib::ChatNode; let node = ChatNode::root("x"); node.clear_cache_breakpoint(); // this node node.clear_all_cache_breakpoints(); // the whole tree }
Warming the cache
ensure_cached fires a zero-output request that writes/refreshes the cache for a
node's prefix, returning the CostInfo of the warm call. Cheap to call before an
agent run: cold pays the one-time write (which you'd pay on the next real call
anyway); warm is a cheap read that refreshes the TTL.
#![allow(unused)] fn main() { use minillmlib::{ChatNode, GeneratorInfo}; async fn run(some_node: ChatNode, generator: GeneratorInfo) -> minillmlib::Result<()> { let warm_cost = some_node.ensure_cached(&generator, None).await?; let _ = warm_cost; Ok(()) } }
Pricing cached tokens
Cache reads and writes have their own rates (read is a discount, write a premium):
#![allow(unused)] fn main() { use minillmlib::TokenPrice; let price = TokenPrice::new(1.0, 5.0) // $/Mtok input, output .with_cache_rates(0.1, 1.25); // $/Mtok cache-read, cache-write }
The three input buckets (uncached / cache-read / cache-write) are billed at their own rates; see Cost Tracking.
Custom Providers
Connecting your own server is one of two cases.
Case A: your server speaks OpenAI's /chat/completions
vLLM, llama.cpp's server, LM Studio, TGI, Ollama's OpenAI endpoint, or your own
OpenAI-compatible wrapper. Nothing custom to write: point custom() at it. The
default GenericProvider handles the wire.
#![allow(unused)] fn main() { use minillmlib::{ChatNode, GeneratorInfo, TokenPrice}; async fn run() -> minillmlib::Result<()> { // base_url is everything BEFORE /chat/completions; the provider appends the path. let gen = GeneratorInfo::custom("my-server", "http://localhost:8000/v1", "my-model") .with_api_key_from_env("MY_SERVER_KEY") // omit entirely if unauthenticated .with_header("X-Tenant", "acme") // any extra gateway headers .with_token_price(TokenPrice::new(0.0, 0.0)); // $/Mtok; 0/0 for a free local model let answer = ChatNode::root("You are helpful.") .chat("hello", &gen).await?; println!("{}", answer.message.text().unwrap_or("")); Ok(()) } }
For an older server that only accepts max_tokens (not max_completion_tokens):
#![allow(unused)] fn main() { use minillmlib::{GeneratorInfo, GenericProvider}; use std::sync::Arc; let gen = GeneratorInfo::custom("old", "http://localhost:8000/v1", "m") .with_provider(Arc::new(GenericProvider { legacy_token_limit: true })); }
Case B: your server has a different wire
Different endpoint, auth header, request/response shape: implement the Provider
trait once and pass it via with_provider. The user-facing API
(root.chat(...)) stays identical.
Below is a complete adapter for a made-up "EchoAI" server with a genuinely
different wire: endpoint /api/generate, auth header X-Echo-Key, request
{model, prompt, settings}, response {output:{text}, meta}. This mirrors the
tested example in tests/integration_tests.rs.
#![allow(unused)] fn main() { use minillmlib::{ Auth, ChatNode, CompletionParameters, CompletionResponse, CostOutcome, GeneratorInfo, Message, MessageContent, Provider, StreamChunk, TokenPrice, Usage, }; use secrecy::ExposeSecret; use std::sync::Arc; #[derive(Debug, Clone)] struct EchoAi; impl Provider for EchoAi { fn endpoint_url(&self, base: &str) -> String { format!("{}/api/generate", base.trim_end_matches('/')) } fn auth_headers(&self, auth: &Auth) -> minillmlib::Result<Vec<(String, String)>> { Ok(match auth.secret() { Some(s) => vec![("X-Echo-Key".into(), s.expose_secret().to_string())], None => vec![], }) } fn build_request( &self, model: &str, messages: &[Message], params: &CompletionParameters, _stream: bool, _include_usage: bool, ) -> minillmlib::Result<serde_json::Value> { // Flatten the conversation into one prompt. Fail loudly on multimodal // (this wire is text-only) instead of silently dropping the attachment. let mut lines = Vec::new(); for m in messages { if let MessageContent::Parts(parts) = &m.content { if parts.iter().any(|p| p.as_text().is_none()) { return Err(minillmlib::MiniLLMError::InvalidParameter( "EchoAI is text-only".into(), )); } } lines.push(format!("{}: {}", m.role.as_str(), m.content.all_text())); } Ok(serde_json::json!({ "model": model, "prompt": lines.join("\n"), "settings": { "max_output_tokens": params.max_tokens.unwrap_or(256) }, })) } fn parse_response(&self, raw: serde_json::Value) -> minillmlib::Result<CompletionResponse> { let text = raw["output"]["text"].as_str() .ok_or_else(|| minillmlib::MiniLLMError::MalformedResponse(raw.to_string()))? .to_string(); Ok(CompletionResponse { id: raw["meta"]["id"].as_str().unwrap_or("").into(), model: raw["meta"]["model"].as_str().unwrap_or("").into(), content: text, finish_reason: raw["stop"].as_str().map(String::from), usage: self.parse_usage(&raw), tool_calls: None, raw_response: Some(raw), }) } fn parse_usage(&self, raw: &serde_json::Value) -> Option<Usage> { let meta = raw.get("meta")?; Some(Usage { uncached_input_tokens: meta["tokens_in"].as_u64().unwrap_or(0) as u32, completion_tokens: meta["tokens_out"].as_u64().unwrap_or(0) as u32, ..Default::default() }) } fn parse_chunk(&self, _data: &str) -> Option<minillmlib::Result<StreamChunk>> { None // non-streaming } fn emits_stream_usage(&self, _requested: bool) -> bool { false // never sends a trailing usage chunk; don't wait for one } fn cost_of(&self, usage: Usage, price: Option<&TokenPrice>) -> CostOutcome { match price { Some(p) => CostOutcome::resolved(p.cost_of(&usage), usage), None => CostOutcome::unpriced(usage), } } } async fn run() -> minillmlib::Result<()> { let gen = GeneratorInfo::custom("echoai", "https://my.host", "echo-1") .with_provider(Arc::new(EchoAi)) .with_api_key("my-secret") .with_token_price(TokenPrice::new(1.0, 5.0)); let answer = ChatNode::root("You are EchoAI.") .chat("hello", &gen).await?; let _ = answer; Ok(()) } }
What to override
The trait defaults to the OpenAI dialect, so you override only what differs:
| Method | Override when |
|---|---|
endpoint_url | the path isn't /chat/completions |
auth_headers | auth isn't Authorization: Bearer |
build_request | the request body isn't the OpenAI shape |
parse_response | the response envelope isn't choices[] |
parse_chunk | streaming chunks aren't OpenAI deltas (return None if non-streaming) |
parse_usage | usage fields differ |
emits_stream_usage | the server may never send a trailing usage chunk (return false, or the stream waits for one that never comes) |
cost_of | cost is derived differently |
resolve_post_stream | there's an out-of-band cost endpoint |
Two rules to copy from the example
- Fail loudly on anything you can't represent. EchoAI rejects multimodal rather than silently flattening it away.
- Override
emits_stream_usagetofalseif your server never sends a trailing usage chunk, or a streaming call will wait for it until the idle timeout.
Claude Subscription
Use your Claude Pro/Max subscription instead of a pay-as-you-go API key. A subscription OAuth token authenticates against the same native Anthropic API as an API key, but draws on your subscription's rolling quota (the 5-hour / 7-day window) rather than API billing.
#![allow(unused)] fn main() { use minillmlib::{ChatNode, GeneratorInfo, TokenPrice}; async fn run() -> minillmlib::Result<()> { // Anthropic returns token counts but no dollar cost, so set a price for a // resolved cost ESTIMATE (otherwise tracking reports `Unpriced`). let generator = GeneratorInfo::claude_subscription("claude-haiku-4-5") .with_token_price(TokenPrice::new(1.0, 5.0)); // $/Mtok in, $/Mtok out let root = ChatNode::root("You are helpful."); let response = root.chat("Hello!", &generator).await?; let _ = response; Ok(()) } }
How the token is resolved
claude_subscription resolves the bearer token in this order:
- the
ANTHROPIC_AUTH_TOKENenv var, if set (explicit override; you keep it fresh, e.g. fromclaude setup-token); - otherwise the live Claude Code credential at
~/.claude/.credentials.json(claudeAiOauth.accessToken), which Claude Code keeps refreshed, so if you're logged into Claude Code with your subscription, it just works.
If neither source yields a token, the request fails loudly as unauthenticated rather than silently using the wrong account.
Subscription vs Console
A subscription token (from Claude Code) bills your Pro/Max plan. A Console/API OAuth token bills your API account, not the subscription. For Console use an API key via
GeneratorInfo::anthropic(model), and this preset only for the actual Pro/Max subscription token.
Cost is always an ESTIMATE here: Anthropic returns only token counts, so the
TokenPrice you set (reflecting the model's published price) produces a
Resolved USD estimate; without it, tracking reports Unpriced.
API Reference
The full, auto-generated API reference (every public type, method, and signature) lives on docs.rs:
It is generated from the source doc comments and rebuilt automatically when a new version is published to crates.io. This guide covers how to use the library; the docs.rs reference is the exhaustive signature lookup.