Extensible
Bring your own model, dataset or even metric. Integrated with Huggingface and also supports local evaluation.
KARMA is designed for researchers, developers, and healthcare organizations who need reliable evaluation of medical AI systems.
Extensible
Bring your own model, dataset or even metric. Integrated with Huggingface and also supports local evaluation.
Fast & Efficient
Process thousands of medical examples efficiently with intelligent caching and batch processing.
Multi-Modal Ready
Support for text, images, and audio evaluation across multiple datasets.
Model Agnostic
Works with any model - Qwen, MedGemma, Bedrock-SDK, OpenAI-SDK or your custom architecture with unified interface.
Get started with KARMA in minutes:
# Install KARMApip install karma-medeval
# Run your first evaluationkarma eval --model "Qwen/Qwen3-0.6B" --datasets openlifescienceai/pubmedqa --max-samples 3
$ karma eval --model "Qwen/Qwen3-0.6B" --datasets openlifescienceai/pubmedqa --max-samples 3
{ "openlifescienceai/pubmedqa": { "metrics": { "exact_match": { "score": 0.3333333333333333, "evaluation_time": 0.9702351093292236, "num_samples": 3 } }, "task_type": "mcqa", "status": "completed", "dataset_args": {}, "evaluation_time": 7.378399848937988 }, "_summary": { "model": "Qwen/Qwen3-0.6B", "model_path": "Qwen/Qwen3-0.6B", "total_datasets": 1, "successful_datasets": 1, "total_evaluation_time": 7.380354166030884, "timestamp": "2025-07-22 18:43:07" }}
Installation
Multiple installation methods with uv, pip, or development setup.
Basic Usage
Learn the CLI commands and start evaluating your first model.
Add Your Own
Extend KARMA with custom models, datasets, and evaluation metrics.
Supported Resources
Complete list of available models, datasets, and metrics.
Ready to evaluate your medical AI models? Get started with installation →