Classifying Elephants

Community Article Published December 27, 2025

For my lectures in Computer Vision I have used the Elephant dataset. Here we use four AI models, with very different AI architectures, to classify Elephants.

My students often fail to understand that AI models are very bad (compared to humans) to generalize. 'Generalization' is something where humans excel: show an image of a real elephant to a 3 year old child and it will recognize a drawing of an elephant instantly.

The traditional model architectures, like ResNet or a Vision Transformer, trained on ImageNet, are quite bad in recognizing drawings of elephants.

The more modern models, trained on very large web-scale datasets, are much better in recognizing drawings of elephants.

Models

The following models were used:

  1. ResNet
  2. Vision Transformer
  3. CLIP
  4. Florence2

Please see the results and a comparision of the models below.

Model Classified as elephant Dataset/size Model Size Remarks
ResNet (2015) 5 / 15 ImageNet 1.4 M images ?
ViT (2020) 5 / 15 ImageNet 1.4 M images 346Mb
CLIP (2022) 8 / 15 400 M images ? Dataset not published
Florence2 (2024) 13 / 15 129 M images 1.5 Gb Highly curated dataset ±5B annotations

It needs further analysis and critical evaluation why the newer models CLIP and Florence2 are better in generalisation than the older models.

Links

Colabs: https://drive.google.com/drive/folders/1rKMTRmqcLBpwHoXoTAfq0bjF7tR9QSrV

Dataset: https://huggingface.co/datasets/MichielBontenbal/elephants

Community

Sign up or log in to comment