Skip to content

karma eval

The karma eval command is the core of KARMA, used to evaluate models on healthcare datasets.

Terminal window
karma eval [OPTIONS]

Evaluate a model on healthcare datasets. This command evaluates a specified model across one or more healthcare datasets, with support for dataset-specific arguments and rich output.

OptionDescription
--model TEXTModel name from registry (e.g., ‘Qwen/Qwen3-0.6B’, ‘google/medgemma-4b-it’) [required]
OptionTypeDefaultDescription
--model-path TEXTTEXT-Model path (local path or HuggingFace model ID). If not provided, uses path from model metadata
--datasets TEXTTEXTallComma-separated dataset names (default: evaluate on all datasets)
--dataset-args TEXTTEXT-Dataset arguments in format ‘dataset:key=val,key2=val2;dataset2:key=val’
--processor-args TEXTTEXT-Processor arguments in format ‘dataset.processor:key=val,key2=val2;dataset2.processor:key=val’
--metric-args TEXTTEXT-Metric arguments in format ‘metric_name:key=val,key2=val2;metric2:key=val’
--batch-size INTEGER1-1288Batch size for evaluation
--cache / --no-cacheFLAGenabledEnable or disable caching for evaluation
--output TEXTTEXTresults.jsonOutput file path
--formattable|jsontableResults display format
--save-formatjson|yaml|csvjsonResults save format
--progress / --no-progressFLAGenabledShow progress bars during evaluation
--interactiveFLAGfalseInteractively prompt for missing dataset, processor, and metric arguments
--dry-runFLAGfalseValidate arguments and show what would be evaluated without running
--model-config TEXTTEXT-Path to model configuration file (JSON/YAML) with model-specific parameters
--model-args TEXTTEXT-Model parameter overrides as JSON string (e.g., ’{“temperature”: 0.7, “top_p”: 0.9}‘)
--max-samples TEXTTEXT-Maximum number of samples to use for evaluation (helpful for testing)
--verboseFLAGfalseEnable verbose output
--refresh-cacheFLAGfalseSkip cache lookup and force regeneration of all results
Terminal window
karma eval --model "Qwen/Qwen3-0.6B" --datasets "openlifescienceai/pubmedqa"
Terminal window
karma eval --model "Qwen/Qwen3-0.6B" --datasets "openlifescienceai/pubmedqa,openlifescienceai/medmcqa"
Terminal window
karma eval --model "ai4bharat/indic-conformer-600m-multilingual" \
--datasets "ai4bharat/IN22-Conv" \
--dataset-args "ai4bharat/IN22-Conv:source_language=en,target_language=hi"
Terminal window
karma eval --model "ai4bharat/indic-conformer-600m-multilingual" \
--datasets "ai4bharat/IN22-Conv" \
--processor-args "ai4bharat/IN22-Conv.devnagari_transliterator:source_script=en,target_script=hi"
Terminal window
karma eval --model "Qwen/Qwen3-0.6B" \
--datasets "Tonic/Health-Bench-Eval-OSS-2025-07" \
--metric-args "rubric_evaluation:provider_to_use=openai,model_id=gpt-4o-mini,batch_size=5"
Terminal window
karma eval --model "Qwen/Qwen3-0.6B" \
--datasets "openlifescienceai/pubmedqa" \
--model-config "config/qwen_medical.json"
Terminal window
karma eval --model "Qwen/Qwen3-0.6B" \
--datasets "openlifescienceai/pubmedqa" \
--model-args '{"temperature": 0.3, "max_tokens": 1024, "enable_thinking": true}'
Terminal window
karma eval --model "Qwen/Qwen3-0.6B" \
--datasets "openlifescienceai/pubmedqa" \
--max-samples 10 --verbose
Terminal window
karma eval --model "Qwen/Qwen3-0.6B" --interactive
Terminal window
karma eval --model "Qwen/Qwen3-0.6B" \
--datasets "openlifescienceai/pubmedqa" \
--dry-run --model-args '{"temperature": 0.5}'
Terminal window
karma eval --model "Qwen/Qwen3-0.6B" \
--datasets "openlifescienceai/pubmedqa" \
--refresh-cache

Model parameters are applied in the following priority order (highest to lowest):

  1. CLI --model-args - Highest priority
  2. Config file (--model-config) - Overrides metadata defaults
  3. Model metadata defaults - From registry
  4. CLI --model-path - Sets model path if metadata doesn’t provide one
{
"temperature": 0.7,
"max_tokens": 2048,
"top_p": 0.9,
"enable_thinking": true
}
temperature: 0.7
max_tokens: 2048
top_p: 0.9
enable_thinking: true
Terminal window
karma list models
Terminal window
karma list datasets
Terminal window
# Wrong
--model-args '{temperature: 0.7}'
# Correct
--model-args '{"temperature": 0.7}'