Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

MiniLLMLib (Rust)

A minimalist, async-first Rust library for talking to Large Language Models over HTTP, with one consistent API across every provider.

The headline idea: ChatNode::root(...).chat(...) is identical no matter what is behind it. A Provider owns the entire wire dialect (endpoint, auth, request body, response and stream envelope, cost accounting). Your code only ever deals in normalized types, so switching from OpenRouter to OpenAI, to a native Anthropic key, to a Claude subscription, or to your own self-hosted server is a one-line change.

What's here

  • Conversation trees. A conversation is a tree of ChatNode handles. Linear chats, branching, and prebuilt history all use the same structure.
  • Multiple providers. OpenRouter, OpenAI, native Anthropic (/v1/messages), a generic OpenAI-compatible provider for self-hosted servers, and your own hand-written impl Provider for any other wire.
  • Streaming over SSE, with an idle-timeout that won't kill a long live generation but fails loudly on a dead connection.
  • Honest cost tracking. Per-provider usage and cost, with disjoint cached/uncached/cache-write token buckets and a CostResolution (Resolved / Unpriced / Unknown) that never reports a fake $0.
  • Prompt caching, marked on the tree and enforced per-provider.
  • Claude subscription auth: use your Pro/Max plan instead of an API key.
  • JSON repair for malformed model output.

Two layers of documentation

LayerWhatWhere
This guideTutorials, patterns, worked examplesthe pages on the left
API referenceEvery public type, method, and signaturedocs.rs/minillmlib

The guide teaches you how to use the library; the API reference (auto-generated from the source by docs.rs) is the exhaustive signature lookup. Start with Quickstart.

Quickstart

Install

# Cargo.toml
[dependencies]
minillmlib = "0.3"
tokio = { version = "1", features = ["full"] }

One call

use minillmlib::{ChatNode, GeneratorInfo};

#[tokio::main]
async fn main() -> minillmlib::Result<()> {
    // Pick a provider. OpenRouter reads OPENROUTER_API_KEY from the environment.
    let generator = GeneratorInfo::openrouter("google/gemini-2.5-flash-lite");

    // A conversation is a tree; `root` is the system prompt.
    let root = ChatNode::root("You are a helpful assistant. Be brief.");

    // `chat` = add a user message + get the assistant reply (returns the new node).
    let answer = root.chat("Say hello in five words.", &generator).await?;

    println!("{}", answer.message.text().unwrap_or(""));
    Ok(())
}

That is the 80% case. Swapping the provider is a one-line change and nothing else moves:

#![allow(unused)]
fn main() {
use minillmlib::GeneratorInfo;
GeneratorInfo::openrouter("google/gemini-2.5-flash-lite");      // OPENROUTER_API_KEY
GeneratorInfo::openai("gpt-4o-mini");                           // OPENAI_API_KEY
GeneratorInfo::anthropic("claude-haiku-4-5");                   // ANTHROPIC_API_KEY, native /v1/messages
GeneratorInfo::claude_subscription("claude-haiku-4-5");         // your Pro/Max plan, no API key
GeneratorInfo::custom("my", "http://localhost:8000/v1", "m");   // your own OpenAI-compatible server
}

See Providers for what each does, and Custom Providers for connecting your own server.

A multi-turn conversation

Each chat returns the assistant node; chain from it to continue the thread.

use minillmlib::{ChatNode, GeneratorInfo};

#[tokio::main]
async fn main() -> minillmlib::Result<()> {
    let gen = GeneratorInfo::openrouter("google/gemini-2.5-flash-lite");
    let root = ChatNode::root("You are a terse assistant.");

    let a1 = root.chat("What's the capital of France?", &gen).await?;
    let a2 = a1.chat("And its population, roughly?", &gen).await?;

    println!("{}", a2.message.text().unwrap_or(""));
    // a2 knows its whole history: a2.thread() is the full root-to-leaf message list.
    Ok(())
}

Errors

Every fallible call returns minillmlib::Result<T>, an alias for Result<T, MiniLLMError>. The library fails loudly: an auth/validation error, a malformed response, or an exhausted retry surface as a typed MiniLLMError, never a silent empty success.

Conversation Trees

A conversation is a tree of ChatNode handles. Each node holds one Message; children are alternate continuations. A linear chat is just a tree with one branch. Holding any node keeps its whole ancestor chain (and the tree) alive.

Building a thread

add_user / add_assistant each return the new node, so you chain them:

#![allow(unused)]
fn main() {
use minillmlib::ChatNode;
let root = ChatNode::root("You are a terse assistant.");
let leaf = root
    .add_user("What's the capital of France?")
    .add_assistant("Paris.")
    .add_user("And Germany?")
    .add_assistant("Berlin.")
    .add_user("And Italy?"); // the turn we want answered
}

leaf.thread() is the full [system, user, assistant, user, assistant, user] message list from root to leaf.

Completing from any node

node.complete(generator, params) uses node's root-to-node path as the prompt and appends the reply as a child of node, returning the new assistant node.

#![allow(unused)]
fn main() {
use minillmlib::{ChatNode, GeneratorInfo};
async fn run(leaf: ChatNode, gen: GeneratorInfo) -> minillmlib::Result<()> {
let answer = leaf.complete(&gen, None).await?; // None = default per-request params
println!("{}", answer.message.text().unwrap_or(""));
Ok(()) }
}

You can complete from any node, not just the leaf, to branch off it. The whole root-to-that-node path is the context.

Prebuilt history from a Vec<Message>

When you already have a message list, from_messages builds the linear chain and hands back (root, leaf). Complete from the leaf.

#![allow(unused)]
fn main() {
use minillmlib::{ChatNode, GeneratorInfo, Message};

async fn run(gen: GeneratorInfo) -> minillmlib::Result<()> {
let history = vec![
    Message::system("You are a terse assistant."),
    Message::user("What's the capital of France?"),
    Message::assistant("Paris."),
    Message::user("And Germany?"),
    Message::assistant("Berlin."),
    Message::user("And Italy?"),
];

let (_root, leaf) = ChatNode::from_messages(&history)?;
let answer = leaf.complete(&gen, None).await?;
Ok(()) }
}

Ownership: keep a handle

The tree lives in a shared arena that stays alive as long as you hold any handle into it. from_messages returns both root and leaf precisely so you don't accidentally drop the only handle. When you chain add_user/add_assistant, holding the final node is enough (it keeps its whole ancestor chain). Drop every handle and the thread is freed.

Saving and loading threads

#![allow(unused)]
fn main() {
use minillmlib::ChatNode;
fn run(leaf: ChatNode) -> minillmlib::Result<()> {
leaf.save_thread("conversation.json")?;
let (root, leaf) = ChatNode::from_thread_file("conversation.json")?;
Ok(()) }
}

Providers

A GeneratorInfo bundles a model, a base URL, an auth strategy, and a Provider (the wire dialect). The provider owns everything that differs between APIs; your calling code never changes. The crate ships these presets:

PresetWireAuth (env var)Cost
GeneratorInfo::openrouter(model)OpenAI /chat/completionsOPENROUTER_API_KEYnative USD, with a /generation fallback
GeneratorInfo::openai(model)OpenAI /chat/completionsOPENAI_API_KEYtoken-only (set a TokenPrice)
GeneratorInfo::anthropic(model)native /v1/messages, content[]ANTHROPIC_API_KEY (x-api-key)token-only (set a TokenPrice)
GeneratorInfo::claude_subscription(model)native /v1/messagesPro/Max OAuth tokentoken-only ESTIMATE
GeneratorInfo::custom(name, base_url, model)OpenAI-compatible (default)none unless you add onetoken-only

Auth

Auth is a strategy on the generator, mapped to concrete headers by the provider (so the same Anthropic provider serves both an API key and a subscription token):

#![allow(unused)]
fn main() {
use minillmlib::GeneratorInfo;
let g = GeneratorInfo::openai("gpt-4o-mini");
g.clone().with_api_key("sk-...");                 // provider picks the header (Bearer / x-api-key)
g.clone().with_api_key_from_env("MY_KEY");        // no-op if the var is unset
g.clone().with_bearer_token("token");             // always Authorization: Bearer
g.clone().with_header("X-Tenant", "acme");        // any extra header
}

Cost for token-only providers

OpenAI and Anthropic return token counts but no dollar amount. Attach a TokenPrice (USD per million tokens, the unit every price sheet quotes) to get a resolved cost; otherwise tracking reports Unpriced (never a fake $0):

#![allow(unused)]
fn main() {
use minillmlib::{GeneratorInfo, TokenPrice};

let gen = GeneratorInfo::anthropic("claude-haiku-4-5")
    .with_token_price(TokenPrice::new(1.0, 5.0)); // $1/Mtok in, $5/Mtok out
}

See Cost Tracking for the full picture.

OpenRouter routing

OpenRouter-specific routing (provider order, sort, data-collection) is attached honestly through the extra escape hatch rather than masquerading as a universal parameter:

#![allow(unused)]
fn main() {
use minillmlib::{CompletionParameters, ProviderSettings};

let routing = ProviderSettings::new()
    .sort_by_throughput()
    .deny_data_collection();

let params = CompletionParameters::new()
    .with_openrouter_routing(routing);
}

Non-OpenRouter providers simply ignore it.

Completion Parameters

Two layers of parameters:

  • CompletionParameters: normalized generation intent (temperature, max tokens, stop, response format, ...). NOT a wire shape: each provider's build_request maps it to its own request body, so the same params drive any provider identically.
  • NodeCompletionParameters: per-request behavior around the call (system prompt override, JSON repair, retry, cost tracking, caching, the wrapped CompletionParameters).

You pass NodeCompletionParameters to complete; None means defaults.

CompletionParameters

#![allow(unused)]
fn main() {
use minillmlib::CompletionParameters;

let params = CompletionParameters::new()
    .with_max_tokens(512)
    .with_temperature(0.7)
    .with_stop(vec!["END".to_string()]);
}
FieldMeaning
max_tokensProvider emits its own key (max_completion_tokens, max_tokens, Anthropic's required max_tokens)
temperature, top_p, top_kSampling
frequency_penalty, presence_penalty, repetition_penaltyPenalties
stopStop sequences (Anthropic stop_sequences)
seedReproducibility
response_formatForce JSON output (with_json_response())
reasoningExtended-thinking effort/budget
tools, tool_choiceTool definitions (OpenAI-shaped, passed through)
extraProvider-specific keys (the honest escape hatch, e.g. OpenRouter routing)

NodeCompletionParameters

#![allow(unused)]
fn main() {
use minillmlib::{CompletionParameters, NodeCompletionParameters};

let params = NodeCompletionParameters::new()
    .with_params(CompletionParameters::new().with_max_tokens(200))
    .with_system_prompt("You are concise.")  // prepend if the thread has no system message
    .expecting_json()                         // parse + repair the response as JSON
    .with_force_prepend("Answer: ")           // make the model continue from this prefix
    .with_cost_tracking(true);                // request usage and fire the cost callback
}
BuilderMeaning
with_params(..)The wrapped CompletionParameters
with_system_prompt(..)Prepend a system message if absent
with_format_kwargs(..) / with_format_kwarg(k, v)Fill {placeholder}s thread-wide at call time
with_parse_json(true) / expecting_json()Repair the response as JSON
with_force_prepend(..)Prime the assistant turn so the model continues it
with_cache(true)Auto-mark the whole prefix for caching (see Caching)
with_cost_tracking(true)Request and report usage/cost
with_token_price(..)Per-request price override
retry, exp_back_off, back_off_time, max_back_offRetry policy
crash_on_refusal, crash_on_empty_responseReject empty / no-JSON responses
timeout_secsTotal deadline (non-streaming) or idle timeout (streaming)

Cost Tracking

The library tracks usage and cost per request, and is honest about when a cost is actually known.

Token buckets

Input tokens are split into three disjoint, additive buckets so caching is priced correctly across every provider's differing wire conventions:

  • uncached_input_tokens: full-price prompt tokens,
  • cache_read_tokens: served from a warm cache (cheap),
  • cache_write_tokens: written to the cache this request (a premium).

Total input is the sum of the three; cost is a clean weighted sum, no subtraction.

Resolution: never a fake $0

Every reported CostInfo carries a CostResolution:

ResolutionMeaning
ResolvedThe USD cost is authoritative (native, or tokens × a configured TokenPrice)
UnpricedTokens are real, but no native cost and no TokenPrice was set. cost is 0.0 but must NOT be treated as a free request. Set a TokenPrice to resolve it.
UnknownCost could not be determined at all (no usage, and any out-of-band query failed)

Check resolution before trusting cost.

A callback per completion

#![allow(unused)]
fn main() {
use minillmlib::{ChatNode, GeneratorInfo, NodeCompletionParameters, CompletionParameters, CostInfo};
use std::sync::{Arc, Mutex};

async fn run() -> minillmlib::Result<()> {
let gen = GeneratorInfo::openrouter("google/gemini-2.5-flash-lite");
let total = Arc::new(Mutex::new(0.0));
let sink = total.clone();

let params = NodeCompletionParameters::new()
    .with_params(CompletionParameters::new().with_max_tokens(200))
    .with_cost_tracking(true)
    .with_cost_callback(move |info: CostInfo| {
        // info.cost, .prompt_tokens, .completion_tokens,
        // .cache_read_tokens, .cache_write_tokens, .resolution
        *sink.lock().unwrap() += info.cost;
    });

let root = ChatNode::root("You are helpful.");
root.add_user("Hi").complete(&gen, Some(&params)).await?;
println!("total spent: {}", *total.lock().unwrap());
Ok(()) }
}

Enforced tracking via CompletionContext

When you want cost reporting to be structurally guaranteed (not opt-in per call), wrap the generator in a CompletionContext and use complete_tracked. It always reports cost through the context's async callback, and on a cancelled or usage-less stream it resolves out-of-band (e.g. OpenRouter's /generation query) or reports Unknown, rather than silently booking $0.

#![allow(unused)]
fn main() {
use minillmlib::{CompletionContext, CostInfo, AsyncCostCallback, CompletionMeta, GeneratorInfo, ChatNode};
use std::sync::Arc;

async fn run() -> minillmlib::Result<()> {
let generator = GeneratorInfo::openrouter("m");
let callback: AsyncCostCallback = Arc::new(|cost: CostInfo, _meta: CompletionMeta| {
    Box::pin(async move {
        // persist `cost` to your DB / metering here
        let _ = cost;
    })
});
let ctx = CompletionContext::new(generator, serde_json::json!({}), callback, "https://app", "App");

let root = ChatNode::root("You are helpful.");
let _answer = root.add_user("Hi").complete_tracked(&ctx, None).await?;
Ok(()) }
}

For streaming, complete_streaming_tracked returns a TrackedStream that settles cost when it finishes or is cancelled (use cancel().await for a reliable settle; a plain drop is best-effort).

Prompt Caching

Caching intent is marked on the conversation tree; the provider decides the wire. Anthropic emits cache_control markers (honoring its 4-breakpoint cap); OpenAI and OpenRouter auto-cache and ignore the marks. Switch the provider and the same code works.

Mark what to cache

#![allow(unused)]
fn main() {
use minillmlib::ChatNode;
let root = ChatNode::root("a large, stable system prompt ...");
root.cache_breakpoint();          // cache just the system prompt

// ...or cache the whole stable prefix of a conversation:
let some_node = root.clone();
some_node.cache_breakpoint();
}

Or, per request, auto-mark the entire prompt prefix without touching individual nodes:

#![allow(unused)]
fn main() {
use minillmlib::NodeCompletionParameters;
let params = NodeCompletionParameters::new().with_cache(true);
}

Explicit per-node marks are always honored in addition.

Clearing marks

#![allow(unused)]
fn main() {
use minillmlib::ChatNode;
let node = ChatNode::root("x");
node.clear_cache_breakpoint();        // this node
node.clear_all_cache_breakpoints();   // the whole tree
}

Warming the cache

ensure_cached fires a zero-output request that writes/refreshes the cache for a node's prefix, returning the CostInfo of the warm call. Cheap to call before an agent run: cold pays the one-time write (which you'd pay on the next real call anyway); warm is a cheap read that refreshes the TTL.

#![allow(unused)]
fn main() {
use minillmlib::{ChatNode, GeneratorInfo};
async fn run(some_node: ChatNode, generator: GeneratorInfo) -> minillmlib::Result<()> {
let warm_cost = some_node.ensure_cached(&generator, None).await?;
let _ = warm_cost;
Ok(()) }
}

Pricing cached tokens

Cache reads and writes have their own rates (read is a discount, write a premium):

#![allow(unused)]
fn main() {
use minillmlib::TokenPrice;

let price = TokenPrice::new(1.0, 5.0)      // $/Mtok input, output
    .with_cache_rates(0.1, 1.25);          // $/Mtok cache-read, cache-write
}

The three input buckets (uncached / cache-read / cache-write) are billed at their own rates; see Cost Tracking.

Custom Providers

Connecting your own server is one of two cases.

Case A: your server speaks OpenAI's /chat/completions

vLLM, llama.cpp's server, LM Studio, TGI, Ollama's OpenAI endpoint, or your own OpenAI-compatible wrapper. Nothing custom to write: point custom() at it. The default GenericProvider handles the wire.

#![allow(unused)]
fn main() {
use minillmlib::{ChatNode, GeneratorInfo, TokenPrice};

async fn run() -> minillmlib::Result<()> {
// base_url is everything BEFORE /chat/completions; the provider appends the path.
let gen = GeneratorInfo::custom("my-server", "http://localhost:8000/v1", "my-model")
    .with_api_key_from_env("MY_SERVER_KEY")    // omit entirely if unauthenticated
    .with_header("X-Tenant", "acme")           // any extra gateway headers
    .with_token_price(TokenPrice::new(0.0, 0.0)); // $/Mtok; 0/0 for a free local model

let answer = ChatNode::root("You are helpful.")
    .chat("hello", &gen).await?;
println!("{}", answer.message.text().unwrap_or(""));
Ok(()) }
}

For an older server that only accepts max_tokens (not max_completion_tokens):

#![allow(unused)]
fn main() {
use minillmlib::{GeneratorInfo, GenericProvider};
use std::sync::Arc;

let gen = GeneratorInfo::custom("old", "http://localhost:8000/v1", "m")
    .with_provider(Arc::new(GenericProvider { legacy_token_limit: true }));
}

Case B: your server has a different wire

Different endpoint, auth header, request/response shape: implement the Provider trait once and pass it via with_provider. The user-facing API (root.chat(...)) stays identical.

Below is a complete adapter for a made-up "EchoAI" server with a genuinely different wire: endpoint /api/generate, auth header X-Echo-Key, request {model, prompt, settings}, response {output:{text}, meta}. This mirrors the tested example in tests/integration_tests.rs.

#![allow(unused)]
fn main() {
use minillmlib::{
    Auth, ChatNode, CompletionParameters, CompletionResponse, CostOutcome, GeneratorInfo,
    Message, MessageContent, Provider, StreamChunk, TokenPrice, Usage,
};
use secrecy::ExposeSecret;
use std::sync::Arc;

#[derive(Debug, Clone)]
struct EchoAi;

impl Provider for EchoAi {
    fn endpoint_url(&self, base: &str) -> String {
        format!("{}/api/generate", base.trim_end_matches('/'))
    }

    fn auth_headers(&self, auth: &Auth) -> minillmlib::Result<Vec<(String, String)>> {
        Ok(match auth.secret() {
            Some(s) => vec![("X-Echo-Key".into(), s.expose_secret().to_string())],
            None => vec![],
        })
    }

    fn build_request(
        &self, model: &str, messages: &[Message], params: &CompletionParameters,
        _stream: bool, _include_usage: bool,
    ) -> minillmlib::Result<serde_json::Value> {
        // Flatten the conversation into one prompt. Fail loudly on multimodal
        // (this wire is text-only) instead of silently dropping the attachment.
        let mut lines = Vec::new();
        for m in messages {
            if let MessageContent::Parts(parts) = &m.content {
                if parts.iter().any(|p| p.as_text().is_none()) {
                    return Err(minillmlib::MiniLLMError::InvalidParameter(
                        "EchoAI is text-only".into(),
                    ));
                }
            }
            lines.push(format!("{}: {}", m.role.as_str(), m.content.all_text()));
        }
        Ok(serde_json::json!({
            "model": model,
            "prompt": lines.join("\n"),
            "settings": { "max_output_tokens": params.max_tokens.unwrap_or(256) },
        }))
    }

    fn parse_response(&self, raw: serde_json::Value) -> minillmlib::Result<CompletionResponse> {
        let text = raw["output"]["text"].as_str()
            .ok_or_else(|| minillmlib::MiniLLMError::MalformedResponse(raw.to_string()))?
            .to_string();
        Ok(CompletionResponse {
            id: raw["meta"]["id"].as_str().unwrap_or("").into(),
            model: raw["meta"]["model"].as_str().unwrap_or("").into(),
            content: text,
            finish_reason: raw["stop"].as_str().map(String::from),
            usage: self.parse_usage(&raw),
            tool_calls: None,
            raw_response: Some(raw),
        })
    }

    fn parse_usage(&self, raw: &serde_json::Value) -> Option<Usage> {
        let meta = raw.get("meta")?;
        Some(Usage {
            uncached_input_tokens: meta["tokens_in"].as_u64().unwrap_or(0) as u32,
            completion_tokens: meta["tokens_out"].as_u64().unwrap_or(0) as u32,
            ..Default::default()
        })
    }

    fn parse_chunk(&self, _data: &str) -> Option<minillmlib::Result<StreamChunk>> {
        None // non-streaming
    }

    fn emits_stream_usage(&self, _requested: bool) -> bool {
        false // never sends a trailing usage chunk; don't wait for one
    }

    fn cost_of(&self, usage: Usage, price: Option<&TokenPrice>) -> CostOutcome {
        match price {
            Some(p) => CostOutcome::resolved(p.cost_of(&usage), usage),
            None => CostOutcome::unpriced(usage),
        }
    }
}

async fn run() -> minillmlib::Result<()> {
let gen = GeneratorInfo::custom("echoai", "https://my.host", "echo-1")
    .with_provider(Arc::new(EchoAi))
    .with_api_key("my-secret")
    .with_token_price(TokenPrice::new(1.0, 5.0));

let answer = ChatNode::root("You are EchoAI.")
    .chat("hello", &gen).await?;
let _ = answer;
Ok(()) }
}

What to override

The trait defaults to the OpenAI dialect, so you override only what differs:

MethodOverride when
endpoint_urlthe path isn't /chat/completions
auth_headersauth isn't Authorization: Bearer
build_requestthe request body isn't the OpenAI shape
parse_responsethe response envelope isn't choices[]
parse_chunkstreaming chunks aren't OpenAI deltas (return None if non-streaming)
parse_usageusage fields differ
emits_stream_usagethe server may never send a trailing usage chunk (return false, or the stream waits for one that never comes)
cost_ofcost is derived differently
resolve_post_streamthere's an out-of-band cost endpoint

Two rules to copy from the example

  • Fail loudly on anything you can't represent. EchoAI rejects multimodal rather than silently flattening it away.
  • Override emits_stream_usage to false if your server never sends a trailing usage chunk, or a streaming call will wait for it until the idle timeout.

Claude Subscription

Use your Claude Pro/Max subscription instead of a pay-as-you-go API key. A subscription OAuth token authenticates against the same native Anthropic API as an API key, but draws on your subscription's rolling quota (the 5-hour / 7-day window) rather than API billing.

#![allow(unused)]
fn main() {
use minillmlib::{ChatNode, GeneratorInfo, TokenPrice};

async fn run() -> minillmlib::Result<()> {
// Anthropic returns token counts but no dollar cost, so set a price for a
// resolved cost ESTIMATE (otherwise tracking reports `Unpriced`).
let generator = GeneratorInfo::claude_subscription("claude-haiku-4-5")
    .with_token_price(TokenPrice::new(1.0, 5.0)); // $/Mtok in, $/Mtok out

let root = ChatNode::root("You are helpful.");
let response = root.chat("Hello!", &generator).await?;
let _ = response;
Ok(()) }
}

How the token is resolved

claude_subscription resolves the bearer token in this order:

  1. the ANTHROPIC_AUTH_TOKEN env var, if set (explicit override; you keep it fresh, e.g. from claude setup-token);
  2. otherwise the live Claude Code credential at ~/.claude/.credentials.json (claudeAiOauth.accessToken), which Claude Code keeps refreshed, so if you're logged into Claude Code with your subscription, it just works.

If neither source yields a token, the request fails loudly as unauthenticated rather than silently using the wrong account.

Subscription vs Console

A subscription token (from Claude Code) bills your Pro/Max plan. A Console/API OAuth token bills your API account, not the subscription. For Console use an API key via GeneratorInfo::anthropic(model), and this preset only for the actual Pro/Max subscription token.

Cost is always an ESTIMATE here: Anthropic returns only token counts, so the TokenPrice you set (reflecting the model's published price) produces a Resolved USD estimate; without it, tracking reports Unpriced.

API Reference

The full, auto-generated API reference (every public type, method, and signature) lives on docs.rs:

https://docs.rs/minillmlib

It is generated from the source doc comments and rebuilt automatically when a new version is published to crates.io. This guide covers how to use the library; the docs.rs reference is the exhaustive signature lookup.