Building first rust project rusty-llm-jury: Fixing Biased AI Judges with Rust

28 May, 2025

After years of building ML platforms in Python, I decided to dive into Rust. Not with toy examples or tutorials, but by porting a real project: rusty-llm-jury, a CLI tool for correcting bias in LLM judge evaluations.

The Original Project and Problem

The project builds on judgy, a Python library created by Shreya Shankar for her AI evals course. The problem it solves is fundamental to AI evaluation: when you use an LLM as a judge to evaluate other AI systems, how do you account for the judge's own biases and errors?

Here's the scenario - you're building a chatbot and want to evaluate its responses. You could have humans rate thousands of outputs, but that's expensive and slow. So you use GPT-4 as a judge. But GPT-4 itself has biases - maybe it's overly generous with creative responses, or tends to penalize certain writing styles. How do you estimate what the "true" performance would be with perfect judgment?

The Rogan-Gladen correction method addresses this. Originally developed for medical diagnostics, it works like this:

Take a small sample of data and get both human labels (ground truth) and LLM judge predictions
Calculate the judge's True Positive Rate (TPR) and True Negative Rate (TNR)
Use these metrics to correct the bias in your larger, unlabeled dataset using: θ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1)
Bootstrap sampling provides confidence intervals around your estimate

The math is straightforward, but implementing it robustly - handling edge cases, file I/O, random sampling, and building a clean interface - made it perfect for learning Rust's strengths.

Why This Project for Learning Rust?

I chose to port judgy to Rust for several reasons, but the most important was project simplicity. The original Python implementation is clean and focused - it does one thing well without unnecessary complexity. This simplicity made it ideal for learning Rust fundamentals without getting lost in domain complexity.

The project hits the sweet spot for Rust learning:

Clear, bounded scope: Statistical bias correction with a well-defined algorithm
Real-world relevance: Solves an actual problem in AI evaluation pipelines
Multiple Rust concepts: Memory management, error handling, CLI building, file I/O, numerical computing
Manageable size: Large enough to be meaningful, small enough to understand completely
Concrete deliverable: A working CLI tool you can actually use

Unlike toy problems or overly complex systems, this project let me focus on how to write Rust rather than what to build. The Python reference implementation provided a clear specification - I knew exactly what the Rust version should do, so I could concentrate on learning idiomatic Rust patterns.

Performance wasn't the primary motivation (though the bootstrap sampling did become dramatically faster). The goal was to understand Rust's approach to memory safety, error handling, and systems design through a practical lens.

First Contact with Ownership

Coming from Python's "everything is a reference" world, Rust's ownership system was jarring. This function signature tells the whole story:

pub fn estimate_success_rate(
    test_labels: &[u8],        // Borrowing, not owning
    test_preds: &[u8],         // These won't be copied
    unlabeled_preds: &[u8],    // Memory stays where it is
    bootstrap_iterations: usize,
    confidence_level: f64,
) -> Result<EstimationResult>  // This we own and return

In Python, I'd pass lists around without thinking. In Rust, that & means "I'm borrowing this data, I won't take ownership, and I promise not to outlive the original." No hidden copying, no garbage collector surprises.

The learning curve was steep. I spent hours fighting borrow checker errors that would have been non-issues in Python. But once it clicked, the confidence was real - if it compiles, memory is handled correctly.

Error Handling Changes Everything

Python's try/catch felt loose after Rust's Result<T, E>. Every function that can fail must declare it:

#[derive(Error, Debug)]
pub enum JudgyError {
    #[error("Input validation error: {0}")]
    InputValidation(String),
    
    #[error("Judge accuracy too low: TPR + TNR = {tpr_plus_tnr:.3} <= 1")]
    JudgeAccuracyTooLow { tpr_plus_tnr: f64 },
    
    #[error("Bootstrap error: {0}")]
    Bootstrap(String),
}

The ? operator makes error propagation clean, but you can't ignore failures. Every error path must be handled. This felt restrictive at first - Python's "happy path" coding was so much faster to write. But the reliability gain is substantial. No more silent failures hiding in production.

Structure Emerges Naturally

Rust's module system guided me toward clean architecture:

src/
├── lib.rs           // Public API
├── main.rs          // CLI entry point  
├── bias_correction.rs // Core algorithms
├── cli.rs           // Command parsing
├── synthetic.rs     // Data generation
├── utils.rs         // File I/O helpers
└── error.rs         // Error types

Each module has clear boundaries. The pub keyword controls what's exposed. Unlike Python where everything is potentially public, Rust makes you explicitly design your interfaces.

The project's simplicity meant I could focus on understanding these architectural patterns rather than managing complex domain logic. Each module had a single responsibility, making it easy to see how Rust's ownership and borrowing rules apply in practice.

The Type System Saves You

This struct definition prevented runtime bugs:

pub struct EstimationResult {
    pub theta_hat: f64,
    pub lower_bound: f64,
    pub upper_bound: f64,
    pub confidence_level: f64,
    pub bootstrap_iterations: usize,
    pub judge_metrics: JudgeMetrics,
    pub raw_pass_rate: f64,
}

No more accidentally passing a string where a number belongs. No forgetting to check if a value is None. The compiler catches these at build time. Coming from dynamic typing, this felt constraining at first. But production debugging became much easier.

CLI Development Surprises

Building command-line interfaces in Rust is pleasant. The clap crate with derive macros handles the heavy lifting:

#[derive(Parser)]
pub struct EstimateArgs {
    #[arg(long, conflicts_with = "test_labels_file")]
    pub test_labels: Option<String>,
    
    #[arg(long, conflicts_with = "test_labels")]
    pub test_labels_file: Option<PathBuf>,
}

That conflicts_with attribute automatically prevents users from specifying both string data and file input. In Python, I'd write manual validation for this. Rust's macro system generates correct parsing code.

Performance Without Trying

The bootstrap sampling that took noticeable time in the original Python implementation became instant in Rust. But more importantly, performance was predictable. No GC pauses, no interpreter overhead, no surprises.

Random number generation with the rand crate felt natural:

let mut rng = StdRng::from_entropy();
let bootstrap_indices: Vec<usize> = indices
    .choose_multiple(&mut rng, test_size)
    .cloned()
    .collect();

Iterator chains like this compile to efficient loops. Zero-cost abstractions work in practice.

Testing Integration

Rust's built-in testing was straightforward:

#[test]
fn test_judge_metrics_perfect_judge() {
    let test_labels = vec![1, 1, 1, 0, 0, 0];
    let test_preds = vec![1, 1, 1, 0, 0, 0];
    
    let metrics = JudgeMetrics::from_test_data(&test_labels, &test_preds).unwrap();
    
    assert_relative_eq!(metrics.tpr, 1.0);
    assert_relative_eq!(metrics.tnr, 1.0);
}

cargo test runs everything - unit tests, integration tests, doc tests. No separate test runner configuration. Test-driven development felt natural.

Ecosystem Quality

The crate ecosystem was solid. Each library follows Rust conventions:

clap for CLI parsing
serde for JSON serialization
thiserror for error handling
rand for random numbers
csv for file parsing

They compose well together. No fighting mismatched interfaces or competing philosophies.

Tooling Excellence

The development experience rivals any language I've used:

cargo handles everything - dependencies, building, testing, docs
rustfmt enforces consistent style
clippy catches subtle bugs and suggests improvements
rust-analyzer provides excellent IDE support

My Makefile shows how these integrate:

check: fmt-check clippy test
	@echo "✅ All checks passed!"

The Struggles

The borrow checker fought me constantly at first. Simple Python patterns required complete rethinking:

// This doesn't work - can't have multiple mutable borrows
// let result1 = process_data(&mut data);
// let result2 = process_data(&mut data);

// Need to sequence operations
let result1 = process_data(&mut data);
drop(result1);
let result2 = process_data(&mut data);

Lifetime annotations confused me for weeks. Compilation took longer than Python's instant feedback loop. But each compile error taught me something.

What I Learned

Building this project taught me that Rust isn't just for systems programming. It's excellent for data processing, CLI tools, and production AI infrastructure. The safety guarantees and performance characteristics make it ideal for reliable evaluation pipelines.

The ownership system, once internalized, changes how you think about data flow. Explicit error handling makes code more robust. The type system catches entire classes of bugs before they reach production.

For AI engineers considering Rust: pick a real project that solves an actual problem, but keep it simple. The learning curve is steep, but the confidence you gain from the compiler's guarantees is worth the investment. Tools like rusty-llm-jury show how Rust can enhance the reliability of AI evaluation workflows without overwhelming complexity.

The project's simplicity was its greatest strength as a learning vehicle. I could focus on mastering Rust's unique concepts without getting bogged down in complex business logic. Every line of code taught me something about the language.

Every compile error was a lesson. Every successful build felt earned. That's Rust - challenging, but ultimately empowering.

Check out the complete rusty-llm-jury source to see these concepts in practice.