Running evaluations
This guide covers the fundamental usage patterns of KARMA for medical AI evaluation.
Evaluate Specific Datasets
Section titled “Evaluate Specific Datasets”# Single datasetkarma eval --model Qwen/Qwen3-0.6B --datasets openlifescienceai/pubmedqa
# Multiple datasetskarma eval --model Qwen/Qwen3-0.6B --datasets "openlifescienceai/pubmedqa,openlifescienceai/medmcqa,openlifescienceai/medqa"
Save Results
Section titled “Save Results”# Save to JSON filekarma eval --model Qwen/Qwen3-0.6B --output results.json
# Save to custom pathkarma eval --model Qwen/Qwen3-0.6B --output /path/to/results.json
Working with Different Models
Section titled “Working with Different Models”Built-in Models
Section titled “Built-in Models”KARMA includes several pre-configured models:
# Qwen modelskarma eval --model Qwen/Qwen3-0.6Bkarma eval --model Qwen/Qwen3-0.6B --model-path "Qwen/Qwen3-1.7B"
# MedGemma modelskarma eval --model medgemma --model-path "google/medgemma-4b-it"
Custom Model Parameters
Section titled “Custom Model Parameters”# Adjust generation parameterskarma eval --model Qwen/Qwen3-0.6B \ --model-args '{"temperature":0.5,"max_tokens":512,"top_p":0.9}'
# Disable thinking mode (for Qwen)karma eval --model Qwen/Qwen3-0.6B \ --model-args '{"enable_thinking":false}'
Dataset Configuration
Section titled “Dataset Configuration”Dataset-Specific Arguments
Section titled “Dataset-Specific Arguments”Some datasets require additional configuration:
# Translation datasets with language pairskarma eval --model Qwen/Qwen3-0.6B \ --datasets "ai4bharat/IN22-Conv" \ --dataset-args "ai4bharat/IN22-Conv:source_language=en,target_language=hi"
# Datasets with specific splitskarma eval --model Qwen/Qwen3-0.6B --datasets "openlifescienceai/medmcqa" \ --dataset-args "openlifescienceai/medmcqa:split=validation"
Performance Optimization
Section titled “Performance Optimization”Batch Processing
Section titled “Batch Processing”# Adjust batch size for your hardwarekarma eval --model Qwen/Qwen3-0.6B --batch-size 8
# Smaller batch for limited memorykarma eval --model Qwen/Qwen3-0.6B --batch-size 2
# Larger batch for high-end hardwarekarma eval --model Qwen/Qwen3-0.6B --batch-size 16
Caching
Section titled “Caching”KARMA uses intelligent caching to speed up repeated evaluations:
# Use cache (default)karma eval --model Qwen/Qwen3-0.6B --cache
# Force fresh evaluationkarma eval --model Qwen/Qwen3-0.6B --no-cache
# Refresh cachekarma eval --model Qwen/Qwen3-0.6B --refresh-cache
Understanding Results
Section titled “Understanding Results”Result Format
Section titled “Result Format”KARMA outputs comprehensive evaluation results:
{ "model": "qwen", "model_path": "Qwen/Qwen3-0.6B", "results": { "openlifescienceai/pubmedqa": { "metrics": { "exact_match": 0.745, "accuracy": 0.745 }, "num_examples": 1000, "runtime_seconds": 45.2, "cache_hit_rate": 0.8 }, "openlifescienceai/medmcqa": { "metrics": { "exact_match": 0.623, "accuracy": 0.623 }, "num_examples": 4183, "runtime_seconds": 120.5, "cache_hit_rate": 0.2 } }, "total_runtime": 165.7, "timestamp": "2025-01-15T10:30:00Z"}
Common Workflows
Section titled “Common Workflows”Model Comparison
Section titled “Model Comparison”# Compare different model sizeskarma eval --model Qwen/Qwen3-0.6B --output qwen_0.6b.jsonkarma eval --model "Qwen/Qwen3-1.7B" --output qwen_1.7b.json
# Compare different modelskarma eval --model Qwen/Qwen3-0.6B --output qwen_results.jsonkarma eval --model "google/medgemma-4b-it" --output medgemma_results.json
Dataset-Specific Evaluation
Section titled “Dataset-Specific Evaluation”# Focus on specific medical domainskarma eval --model Qwen/Qwen3-0.6B \ --datasets "openlifescienceai/pubmedqa,openlifescienceai/medmcqa,openlifescienceai/medqa" # Text-based QA
karma eval --model Qwen/Qwen3-0.6B \ --datasets "mdwiratathya/SLAKE-vqa-english,flaviagiammarino/vqa-rad" # Vision-language tasks
Parameter Tuning
Section titled “Parameter Tuning”# Test different temperature settingskarma eval --model Qwen/Qwen3-0.6B \ --model-args '{"temperature":0.1}' --output temp_0.1.json
karma eval --model Qwen/Qwen3-0.6B \ --model-args '{"temperature":0.7}' --output temp_0.7.json
karma eval --model Qwen/Qwen3-0.6B \ --model-args '{"temperature":1.0}' --output temp_1.0.json