Butterflies and conservation


27 August 2025

Butterflies and conservation: largest AI dataset now released


The vast treasure trove of 540,000 images of 185 Austrian butterfly and moth species was collected over several years by thousands of volunteers. Using this resource, Innsbruck researcher Friederike Barkmann trained an AI model for biodiversity studies on european supercomputers. The data are now available to scientists worldwide for further research.


What makes this dataset special?

 

  • The incredible volume of 540,000 photos makes it the largest dataset worldwide for butterflies and moths.
  • All photos, models and scripts are now publicly accessible.


As part of the Schmetterlinge Österreichs project, more than 25,000 volunteers captured hundreds of thousands of photos of Austrian butterflies and moths between 2016 and 2023 and made them available to the project. Using these images, Friederike Barkmann, ecologist and data scientist at the University of Innsbruck, trained an AI model to automatically identify individual species – saving both time and costs. From the outset, it was clear that the AI model should also be shared with other researchers to help improve biodiversity research elsewhere.


What does this have to do with supercomputers?


Training an AI model on such a massive dataset of 540,000 images requires enormous computing power – and this is precisely what supercomputing (also called high-performance computing, or HPC) provides. Friederike Barkmann first trained her model on the Innsbruck HPC system LEO5. When training on this machine became too time-consuming, HPC expert Andreas Lindner from EuroCC Austria supported her in parallelising the computational tasks (linking multiple processors to share the workload, which significantly accelerates the process). After this step, Friederike switched to LEONARDO, one of Europe’s largest supercomputers. This reduced training times by 90 percent. In the end, the AI model correctly identified 97 percent of butterfly species. Final fine-tuning was carried out on the LUMI supercomputer.


Why butterflies and moths in particular?


Butterflies and moths are important indicators of biodiversity. Knowing the habitat and density of species provides valuable insights into climate change and global biodiversity.


What can the data be used for now?


The open dataset provides researchers worldwide with a foundation for training and testing their own AI models for species identification. This strengthens a wide range of research areas, including climate change studies.


Links


The dataset is available on figshare and GitHub (scripts for model training only).

The related Data Paper is published in Springer Nature Scientific Data.

Project: Schmetterlinge Österreichs

Detailed EuroCC blog post on Friederike Barkmann’s project