DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets
Journal:
arXiv
Published Date:
Jan 31, 2025
Abstract
A major barrier to developing vision large language models (LLMs) in
dermatology is the lack of large image--text pairs dataset. We introduce
DermaSynth, a dataset comprising of 92,020 synthetic image--text pairs curated
from 45,205 images (13,568 clinical and 35,561 dermatoscopic) for
dermatology-related clinical tasks. Leveraging state-of-the-art LLMs, using
Gemini 2.0, we used clinically related prompts and self-instruct method to
generate diverse and rich synthetic texts. Metadata of the datasets were
incorporated into the input prompts by targeting to reduce potential
hallucinations. The resulting dataset builds upon open access dermatological
image repositories (DERM12345, BCN20000, PAD-UFES-20, SCIN, and HIBA) that have
permissive CC-BY-4.0 licenses. We also fine-tuned a preliminary
Llama-3.2-11B-Vision-Instruct model, DermatoLlama 1.0, on 5,000 samples. We
anticipate this dataset to support and accelerate AI research in dermatology.
Data and code underlying this work are accessible at
https://github.com/abdurrahimyilmaz/DermaSynth.