Sanity benchmark

To ensure that we have implemented the datasets loading, model invocation and metric calculation correctly, we have invoked the model and have reproduced numbers.

MedGemma-4B Reproduction

In case of Medgemma, we have been able to reproduce the results for most datasets as claimed in their technical report and huggingface readme page.