Skip to content

Add dataset

You can create custom datasets by inheriting from BaseMultimodalDataset and implementing the format_item method to return a properly formatted DataLoaderIterable:

from karma.eval_datasets.base_dataset import BaseMultimodalDataset
from karma.registries.dataset_registry import register_dataset
from karma.data_models.dataloader_iterable import DataLoaderIterable

Here we will use the register_dataset decorator to register and make the dataset discoverable to the CLI. This decorator also has information about the metric to use and any arguments that can be configured.

@register_dataset(
"my_medical_dataset",
metrics=["exact_match", "accuracy"],
task_type="mcqa",
required_args=["split"],
optional_args=["subset"],
default_args={"split": "test"}
)
class MyMedicalDataset(BaseMultimodalDataset):
"""Custom medical dataset."""
def __init__(self, split: str = "test", **kwargs):
self.split = split
super().__init__(**kwargs)
def load_data(self):
# Load your dataset
return your_dataset_loader(split=self.split)
def format_item(self, item):
"""Format each item into DataLoaderIterable format."""
# Example for text-based dataset
return DataLoaderIterable(
input=f"Question: {item['question']}\nChoices: {item['choices']}",
expected_output=item['answer'],
other_args={"question_id": item['id']}
)

In the class, we implement the format_item method to specify how the output will be like through the DataLoaderIterable See (DataLoaderIterable)[user-guide/datasets/data-loader-iterable] for more information.

For datasets that combine multiple modalities:

def format_item(self, item):
"""Format multi-modal item."""
return DataLoaderIterable(
input=f"Question: {item['question']}",
images=[item['image_bytes']], # List of image bytes
audio=item.get('audio_bytes'), # Optional audio
expected_output=item['answer'],
other_args={
"question_type": item['type'],
"difficulty": item['difficulty']
}
)

For datasets with multi-turn conversations:

from karma.data_models.dataloader_iterable import Conversation, ConversationTurn
def format_item(self, item):
"""Format conversation item."""
conversation_turns = []
for turn in item['conversation']:
conversation_turns.append(
ConversationTurn(
content=turn['content'],
role=turn['role'] # 'user' or 'assistant'
)
)
return DataLoaderIterable(
conversation=Conversation(conversation_turns=conversation_turns),
system_prompt=item.get('system_prompt', ''),
expected_output=item['expected_response']
)

The DataLoaderIterable format ensures that all datasets work seamlessly with any model type, whether it’s text-only, multi-modal, or conversation-based. Models receive the appropriate data fields and can process them according to their capabilities.

This guide will walk you through how to plug that dataset into KARMA’s evaluation pipeline. Let’s say we are trying to integrate an MCQA dataset.

  1. Organize Your Dataset Ensure your dataset is structured correctly.
    Each row should ideally include:

    • A question
    • A list of options (optional)
    • The correct answer
    • Optionally: metadata like category, generic name, or citation
  2. Set Up a Custom Dataset Class KARMA supports registering your own datasets using a decorator.

    @register_dataset(
    dataset_name="mcqa-local",
    split="test",
    metrics=["exact_match"],
    task_type="mcqa",
    )
    class LocalDataset(BaseMultimodalDataset):
    ...

    This decorator registers your dataset with KARMA for evaluations.

  3. Load your Dataset In your Dataset class, load your dataset file.
    You can use any format supported by pandas, such as CSV or Parquet.

    def __init__(self, ...):
    self.data_path = <path_to_your_dataset>
    if not os.path.exists(self.data_path):
    raise FileNotFoundError(f"Dataset file not found: {self.data_path}")
    self.df = pd.read_parquet(self.data_path)
    ...
  4. Implement the format_item Method Each row in your dataset will be converted into an input-output pair for the model.

    def format_item(self, sample: Dict[str, Any]) -> DataLoaderIterable:
    input_text = self._format_question(sample["data"])
    correct_answer = sample["data"]["ground_truth"]
    prompt = self.confinement_instructions.replace("<QUESTION>", input_text)
    dataloader_item = DataLoaderIterable(
    input=prompt, expected_output=correct_answer
    )
    dataloader_item.conversation = None
    return dataloader_item
  5. Iterate Over the Dataset Implement __iter__() to yield formatted examples.

    def __iter__(self) -> Generator[Dict[str, Any], None, None]:
    if self.dataset is None:
    self.dataset = list(self.load_eval_dataset())
    for idx, sample in enumerate(self.dataset):
    if self.max_samples is not None and idx >= self.max_samples:
    break
    item = self.format_item(sample)
    yield item
  6. Handle Model Output Extract the model’s predictions.

    def extract_prediction(self, response: str) -> Tuple[str, bool]:
    answer, success = "", False
    if "Final Answer:" in response:
    answer = response.split("Final Answer:")[1].strip()
    if answer.startswith("(") and answer.endswith(")"):
    answer = answer[1:-1]
    success = True
    return answer, success
  7. Yield Examples for Evaluation Read from your DataFrame and return structured examples.

    def load_eval_dataset(self, ...):
    for _, row in self.df.iterrows():
    prediction = None
    parsed_output = row.get("model_output_parsed", None)
    if isinstance(parsed_output, dict):
    prediction = parsed_output.get("prediction", None)
    yield {
    "id": row["index"],
    "data": {
    "question": row["question"],
    "options": row["options"],
    "ground_truth": row["ground_truth"],
    },
    "prediction": prediction,
    "metadata": {
    "generic_name": row.get("generic_name", None),
    "category": row.get("category", None),
    "citation": row.get("citation", None),
    },
    }