Status: Proposed Date: 2026-01-20 Decision Makers: Ruvector Architecture Team Technical Area: LLM Generation / Structured Output
RuvLLM v2.3 provides robust text generation capabilities but lacks structured output enforcement, which is critical for production agentic workflows. Modern frameworks (LangChain, CrewAI, Claude Flow, AutoGen) rely on LLMs producing valid JSON for tool use, function calling, and structured data extraction. Without JSON mode support, RuvLLM cannot reliably power these workflows.
RuvLLM's existing generate interface returns unstructured text:
pub trait LlmBackend {
fn generate(&self, prompt: &str, params: GenerateParams) -> Result<String>;
fn generate_stream(&self, prompt: &str, params: GenerateParams) -> impl Stream<Item = String>;
}Users requesting JSON output face:
- Malformed JSON: Models generate invalid JSON (~5-15% failure rate even with prompting)
- No schema validation: Output may be valid JSON but violate expected structure
- Post-processing overhead: Parsing, validation, and error handling must be manual
- Retry complexity: Applications must implement retry loops with repair attempts
- Agentic Framework Integration: LangChain, CrewAI, Claude Flow require guaranteed JSON for tool/function calling
- Production Reliability: 95%+ success rate needed; current prompting-based approaches achieve 85-95%
- Schema Enforcement: Output must conform to JSON Schema or Pydantic models
- Performance: Constrained decoding adds computational overhead to generation
Without JSON Mode:
# Current unreliable workflow
response = llm.generate("Extract person info as JSON: {text}")
try:
data = json.loads(response) # May fail
assert "name" in data # May fail
assert "age" in data # May fail
except:
# Retry with prompt engineering, repair attempts, etc.
passWith JSON Mode:
# Reliable workflow with schema
schema = {"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}}}
response = llm.generate_json("Extract person info: {text}", schema=schema)
# Guaranteed valid JSON conforming to schema- 99%+ valid JSON: Eliminate malformed JSON failures
- Schema conformance: Guarantee output matches expected structure
- Graceful degradation: Repair mode for minor violations vs strict failure
- Minimal overhead: <10% latency increase for JSON mode
- Streaming compatible: Support streaming JSON generation
- Scalable: Constrained decoding must work with large vocabularies (32K-128K tokens)
- Framework integration: Compatible with LangChain, CrewAI, Claude Flow tool use
- Schema standards: Support JSON Schema, Pydantic models, TypeScript interfaces
- Backward compatibility: Existing
generateinterface unchanged
- Simple API: Single parameter enables JSON mode
- Validation feedback: Clear error messages on schema violations
- Grammar flexibility: Support custom grammars for domain-specific formats
Validate and repair JSON after generation completes.
Pros:
- Zero generation overhead
- Simple implementation
- Works with any model
Cons:
- Does not prevent invalid JSON (still 5-15% failures)
- Repair attempts may fail or produce incorrect data
- Wasted compute on failed generations
- Requires retry loops
Modify logits during generation to enforce JSON grammar at each token.
Pros:
- Guaranteed valid JSON (100% success rate)
- No retry loops needed
- Works with streaming generation
- Can enforce complex grammars
Cons:
- 5-10% latency overhead per token
- Implementation complexity (state machine for JSON structure)
- Requires access to model logits
Train separate model checkpoints optimized for JSON output.
Pros:
- Best performance (native JSON understanding)
- No generation overhead
- Highest quality output
Cons:
- Requires training infrastructure
- Multiple model variants to maintain
- Does not generalize to custom schemas
- High storage/deployment cost
Chosen Option: Option B - Constrained Decoding with Optional Post-Validation
Implement token-level constrained decoding as the primary JSON mode, with optional post-generation validation for models without logit access. This provides guaranteed JSON validity with acceptable performance overhead.
- Reliability first: Agentic workflows require 99%+ success rates; only constrained decoding guarantees this
- Framework compatibility: LangChain, CrewAI, Claude Flow expect reliable JSON mode
- Streaming support: Constrained decoding works with streaming generation
- Graceful fallback: Post-validation mode for models/backends without logit access
- Industry standard: Matches llama.cpp (GBNF), Outlines, guidance library approaches
/// JSON Mode configuration for structured output
#[derive(Debug, Clone)]
pub struct JsonModeConfig {
/// Optional JSON Schema for validation
pub schema: Option<JsonSchema>,
/// Strict mode: fail on invalid JSON (vs repair attempts)
pub strict: bool,
/// Repair mode: attempt to fix malformed JSON
pub repair: bool,
/// Grammar file for custom structured formats (GBNF-compatible)
pub grammar: Option<String>,
/// Enable constrained decoding (vs post-validation only)
pub constrained_decoding: bool,
}
impl Default for JsonModeConfig {
fn default() -> Self {
Self {
schema: None,
strict: true,
repair: false,
grammar: None,
constrained_decoding: true,
}
}
}
/// Extended generation parameters with JSON mode
#[derive(Debug, Clone)]
pub struct GenerateParams {
// Existing fields
pub max_tokens: usize,
pub temperature: f32,
pub top_p: f32,
// New JSON mode
pub json_mode: Option<JsonModeConfig>,
}
/// LLM Backend trait with JSON mode support
pub trait LlmBackend {
/// Existing text generation
fn generate(&self, prompt: &str, params: GenerateParams) -> Result<String>;
/// JSON-structured generation (convenience wrapper)
fn generate_json(
&self,
prompt: &str,
schema: Option<JsonSchema>,
params: GenerateParams
) -> Result<serde_json::Value> {
let mut json_params = params.clone();
json_params.json_mode = Some(JsonModeConfig {
schema,
..Default::default()
});
let output = self.generate(prompt, json_params)?;
serde_json::from_str(&output)
.map_err(|e| Error::msg(format!("Invalid JSON output: {}", e)))
}
/// Streaming generation with JSON mode
fn generate_stream(
&self,
prompt: &str,
params: GenerateParams
) -> impl Stream<Item = Result<String>>;
}use schemars::schema::RootSchema;
use serde_json::Value;
/// JSON Schema for validation
#[derive(Debug, Clone)]
pub struct JsonSchema {
/// JSON Schema specification (Draft 7 or 2020-12)
pub schema: RootSchema,
}
impl JsonSchema {
/// Create from JSON Schema string
pub fn from_str(schema_json: &str) -> Result<Self> {
let schema: RootSchema = serde_json::from_str(schema_json)?;
Ok(Self { schema })
}
/// Create from Pydantic-style Rust struct
pub fn from_type<T: schemars::JsonSchema>() -> Self {
let schema = schemars::schema_for!(T);
Self { schema }
}
/// Validate JSON value against schema
pub fn validate(&self, value: &Value) -> Result<()> {
let validator = jsonschema::validator_for(&serde_json::to_value(&self.schema)?)?;
validator.validate(value)
.map_err(|e| Error::msg(format!("Schema validation failed: {}", e)))
}
}/// Token-level JSON constraint enforcer
pub struct JsonConstraintDecoder {
/// Current state in JSON grammar (object, array, key, value, etc.)
state: JsonState,
/// Stack of open structures (brackets, braces)
structure_stack: Vec<StructureType>,
/// Expected schema at current position
schema_context: Option<SchemaNode>,
}
#[derive(Debug, Clone, Copy, PartialEq)]
enum JsonState {
Start,
ObjectStart,
ObjectKey,
ObjectColon,
ObjectValue,
ArrayStart,
ArrayValue,
String,
Number,
Boolean,
Null,
End,
}
#[derive(Debug, Clone, Copy, PartialEq)]
enum StructureType {
Object,
Array,
}
impl JsonConstraintDecoder {
/// Apply logit bias based on current state
pub fn apply_constraints(&mut self, logits: &mut [f32], vocab: &Vocabulary) -> Result<()> {
match self.state {
JsonState::Start => {
// Only allow '{' or '['
self.mask_except(logits, vocab, &["{", "["])?;
}
JsonState::ObjectStart => {
// Allow '"' for key or '}' for empty object
self.mask_except(logits, vocab, &["\"", "}"])?;
}
JsonState::ObjectKey => {
// Must be string token (continue string or close with ")
self.allow_string_tokens(logits, vocab)?;
}
JsonState::ObjectColon => {
// Must be ':'
self.mask_except(logits, vocab, &[":"])?;
}
JsonState::ObjectValue => {
// Allow any valid JSON value start
self.allow_value_start(logits, vocab)?;
}
JsonState::ArrayValue => {
// Allow any valid JSON value start or ']' to close
self.allow_value_start(logits, vocab)?;
self.allow_token(logits, vocab, "]")?;
}
// ... other states
_ => {}
}
Ok(())
}
/// Update state based on generated token
pub fn update_state(&mut self, token: &str) -> Result<()> {
match (self.state, token) {
(JsonState::Start, "{") => {
self.structure_stack.push(StructureType::Object);
self.state = JsonState::ObjectStart;
}
(JsonState::Start, "[") => {
self.structure_stack.push(StructureType::Array);
self.state = JsonState::ArrayStart;
}
(JsonState::ObjectStart, "\"") => {
self.state = JsonState::ObjectKey;
}
(JsonState::ObjectKey, "\"") => {
self.state = JsonState::ObjectColon;
}
// ... state transitions
_ => return Err(Error::msg("Invalid JSON token sequence"))
}
Ok(())
}
/// Check if generation is complete
pub fn is_complete(&self) -> bool {
self.state == JsonState::End && self.structure_stack.is_empty()
}
fn mask_except(&self, logits: &mut [f32], vocab: &Vocabulary, allowed: &[&str]) -> Result<()> {
// Set all logits to -inf except allowed tokens
logits.iter_mut().for_each(|l| *l = f32::NEG_INFINITY);
for token in allowed {
if let Some(id) = vocab.token_to_id(token) {
logits[id] = 0.0; // Reset to neutral
}
}
Ok(())
}
}impl JsonConstraintDecoder {
/// Apply schema constraints at current position
fn apply_schema_constraints(&mut self, logits: &mut [f32], vocab: &Vocabulary) -> Result<()> {
if let Some(schema) = &self.schema_context {
match schema {
SchemaNode::String => {
// Only allow string tokens
self.allow_string_tokens(logits, vocab)?;
}
SchemaNode::Integer => {
// Only allow numeric tokens (no decimal point)
self.allow_integer_tokens(logits, vocab)?;
}
SchemaNode::Boolean => {
// Only allow 'true' or 'false'
self.mask_except(logits, vocab, &["true", "false"])?;
}
SchemaNode::Enum(values) => {
// Only allow tokens from enum values
let allowed: Vec<&str> = values.iter().map(|s| s.as_str()).collect();
self.mask_except(logits, vocab, &allowed)?;
}
SchemaNode::Object(props) => {
// Only allow property names from schema
let allowed: Vec<&str> = props.keys().map(|s| s.as_str()).collect();
self.allow_tokens(logits, vocab, &allowed)?;
}
// ... other schema types
}
}
Ok(())
}
}/// GBNF (llama.cpp) compatible grammar
#[derive(Debug, Clone)]
pub struct Grammar {
/// Grammar rules in GBNF format
rules: HashMap<String, GrammarRule>,
/// Start rule name
start: String,
}
#[derive(Debug, Clone)]
enum GrammarRule {
/// Terminal: exact string match
Terminal(String),
/// Reference to another rule
Reference(String),
/// Sequence: rules in order
Sequence(Vec<GrammarRule>),
/// Choice: one of multiple rules
Choice(Vec<GrammarRule>),
/// Optional: zero or one
Optional(Box<GrammarRule>),
/// Repeat: zero or more
Repeat(Box<GrammarRule>),
}
impl Grammar {
/// Parse GBNF grammar string
pub fn from_gbnf(grammar_str: &str) -> Result<Self> {
// Parse GBNF format (similar to llama.cpp)
// Example:
// root ::= object
// object ::= "{" ws members ws "}"
// members ::= pair (ws "," ws pair)*
// pair ::= string ws ":" ws value
// ...
todo!("GBNF parser implementation")
}
/// Create JSON grammar
pub fn json() -> Self {
// Built-in JSON grammar
todo!("Built-in JSON grammar")
}
/// Apply grammar constraints to logits
pub fn apply_constraints(
&self,
current_state: &GrammarState,
logits: &mut [f32],
vocab: &Vocabulary
) -> Result<()> {
// Determine valid next tokens based on grammar state
let valid_tokens = self.get_valid_tokens(current_state)?;
// Mask logits for invalid tokens
logits.iter_mut().for_each(|l| *l = f32::NEG_INFINITY);
for token in valid_tokens {
if let Some(id) = vocab.token_to_id(&token) {
logits[id] = 0.0;
}
}
Ok(())
}
}/// JSON repair and validation (for backends without logit access)
pub struct JsonValidator {
schema: Option<JsonSchema>,
strict: bool,
repair: bool,
}
impl JsonValidator {
/// Validate and optionally repair JSON output
pub fn validate(&self, output: &str) -> Result<String> {
// Attempt to parse JSON
match serde_json::from_str::<Value>(output) {
Ok(value) => {
// Valid JSON, check schema
if let Some(schema) = &self.schema {
schema.validate(&value)?;
}
Ok(output.to_string())
}
Err(e) if self.repair => {
// Attempt repair
self.repair_json(output)
}
Err(e) if self.strict => {
Err(Error::msg(format!("Invalid JSON: {}", e)))
}
Err(_) => {
// Non-strict mode: return as-is with warning
Ok(output.to_string())
}
}
}
fn repair_json(&self, output: &str) -> Result<String> {
// Common repairs:
// 1. Add missing closing braces/brackets
// 2. Fix trailing commas
// 3. Escape unescaped quotes
// 4. Remove markdown code fences
let mut repaired = output.to_string();
// Remove markdown code fences
repaired = repaired
.trim_start_matches("```json")
.trim_start_matches("```")
.trim_end_matches("```")
.trim()
.to_string();
// Count open/close braces and brackets
let open_braces = repaired.matches('{').count();
let close_braces = repaired.matches('}').count();
let open_brackets = repaired.matches('[').count();
let close_brackets = repaired.matches(']').count();
// Add missing closing characters
for _ in close_braces..open_braces {
repaired.push('}');
}
for _ in close_brackets..open_brackets {
repaired.push(']');
}
// Validate repaired JSON
serde_json::from_str::<Value>(&repaired)
.map(|_| repaired)
.map_err(|e| Error::msg(format!("Repair failed: {}", e)))
}
}Effort: 2-3 days
- Implement
JsonModeConfigandJsonSchematypes - Add
json_modefield toGenerateParams - Implement post-generation validation with
JsonValidator - Add
generate_jsonconvenience method - Tests for validation and repair
Deliverables:
- Post-validation JSON mode working with all backends
- Schema validation with JSON Schema Draft 7
- Basic repair for common issues
Effort: 5-7 days
- Implement
JsonConstraintDecoderstate machine - Integrate with Candle backend logit processing
- Add schema-aware constraints
- Streaming support for JSON mode
- Benchmark performance overhead
Deliverables:
- Constrained decoding for Candle backend
- 99%+ valid JSON success rate
- <10% latency overhead
- Streaming JSON generation
Effort: 7-10 days
- Implement GBNF grammar parser
- Build grammar state machine
- Create built-in grammars (JSON, JSONL, CSV, XML)
- Custom grammar API
- Grammar compilation and optimization
Deliverables:
- GBNF-compatible grammar system
- Built-in grammars for common formats
- Custom grammar support
Effort: 3-5 days
- Integrate with mistral-rs backend (ADR-008)
- Framework adapters (LangChain, CrewAI)
- Performance optimization (caching valid tokens)
- Documentation and examples
Deliverables:
- Framework integration examples
- Optimized constraint checking
- Comprehensive documentation
| Mode | Overhead | Notes |
|---|---|---|
| No JSON mode | 0% | Baseline |
| Post-validation only | <1% | Validation after generation |
| Constrained decoding | 5-10% | Per-token logit masking |
| Grammar-based | 8-12% | Complex grammar state machine |
| Component | Memory | Notes |
|---|---|---|
| JSON state machine | ~1KB | Negligible |
| Schema tree | 10-100KB | Depends on schema complexity |
| Grammar rules | 50-500KB | GBNF grammar compilation |
| Valid token cache | 100-500KB | Per-state valid token sets |
| Method | Valid JSON Rate | Schema Conformance |
|---|---|---|
| Prompt engineering only | 85-95% | 70-85% |
| Post-validation + repair | 95-98% | 85-95% |
| Constrained decoding | 99.9%+ | 99%+ |
- Production reliability: 99%+ success rate enables reliable agentic workflows
- Framework compatibility: Direct integration with LangChain, CrewAI, Claude Flow
- Developer experience: Simple API eliminates retry loops and error handling
- Streaming support: JSON mode works with streaming generation
- Future extensibility: Grammar support enables custom structured formats
- Performance overhead: 5-10% latency increase for constrained decoding
- Implementation complexity: State machine and grammar parsing add code complexity
- Backend limitations: Not all backends support logit access (fallback to post-validation)
- Token vocabulary dependency: Constraint effectiveness depends on tokenizer granularity
- Optional feature: JSON mode is opt-in via
GenerateParams - Graceful degradation: Falls back to post-validation for unsupported backends
- Schema flexibility: Supports JSON Schema, Pydantic, and custom grammars
| Risk | Mitigation |
|---|---|
| High latency overhead | Cache valid token sets per state; optimize state transitions |
| Complex grammar bugs | Extensive test suite with fuzzing; start with simple JSON grammar |
| Tokenizer edge cases | Handle subword tokens; fallback to character-level constraints |
| Schema complexity | Limit schema depth; provide performance warnings for complex schemas |
- Rejected: 85-95% success rate insufficient for production
- Consideration: Still useful as complementary technique
- Rejected: Requires separate models; doesn't generalize to custom schemas
- Consideration: Could offer as optimization for common cases
- Rejected: Adds network latency; doesn't prevent generation failures
- Consideration: Could integrate as async validation for auditing
- ADR-001: Ruvector Core Architecture (HNSW, Graph Store)
- ADR-002: RuvLLM Integration with Ruvector
- ADR-007: Security Review & Technical Debt
- ADR-008: mistral-rs Integration for Production-Scale LLM Serving
- JSON Schema Draft 7 (primary support)
- JSON Schema 2020-12 (future)
- Pydantic model compatibility
- GBNF (llama.cpp) compatibility
- EBNF subset for custom grammars
- Regex-based constraints (limited support)
- LangChain StructuredOutputParser
- CrewAI tool schemas
- Claude Flow structured outputs
- AutoGen function calling
- Unit tests for state machine transitions
- Integration tests with sample schemas
- Fuzzing for grammar parser
- Benchmark suite for performance
- Framework integration tests
- JSON mode API guide
- Schema definition tutorial
- Grammar syntax reference
- Framework integration examples
- Performance optimization guide
- llama.cpp GBNF: https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md
- Outlines Library: https://github.com/outlines-dev/outlines - Structured text generation
- Guidance Library: https://github.com/guidance-ai/guidance - Constrained generation
- JSON Schema: https://json-schema.org/specification
- LangChain StructuredOutput: https://python.langchain.com/docs/modules/model_io/output_parsers/structured
- OpenAI JSON Mode: https://platform.openai.com/docs/guides/structured-outputs
- Anthropic Tool Use: https://docs.anthropic.com/en/docs/build-with-claude/tool-use
| Component | Status | Effort | Notes |
|---|---|---|---|
| JsonModeConfig types | Pending | 0.5 days | Basic config structures |
| JsonSchema validation | Pending | 1 day | JSON Schema Draft 7 support |
| Post-validation mode | Pending | 1 day | Fallback for all backends |
| JSON repair | Pending | 1 day | Common malformation fixes |
| JsonConstraintDecoder | Pending | 3 days | State machine for JSON grammar |
| Schema-aware constraints | Pending | 2 days | Schema-driven logit masking |
| Streaming JSON | Pending | 2 days | Stream-compatible constraints |
| GBNF parser | Pending | 5 days | Grammar definition language |
| Grammar state machine | Pending | 3 days | Generic grammar constraints |
| Built-in grammars | Pending | 2 days | JSON, JSONL, CSV, XML |
| Candle integration | Pending | 2 days | Wire to Candle backend |
| mistral-rs integration | Pending | 2 days | Wire to mistral-rs backend |
| Framework adapters | Pending | 3 days | LangChain, CrewAI examples |
| Performance optimization | Pending | 2 days | Token caching, fast paths |
| Documentation | Pending | 3 days | API guide, examples, tutorials |
Total Effort: ~30-35 days (1 developer) Phased Delivery: 4-6 weeks
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-01-20 | Ruvector Architecture Team | Initial proposal |