Core ML Models

Build intelligence into your apps using machine learning models from the research community designed for Core ML.

Models are in Core ML format and can be integrated into Xcode projects. You can select different versions of models to optimize for sizes and architectures.

image

FastViT

Image Classification

A Fast Hybrid Vision Transformer architecture trained to classify the dominant object in a camera frame or image.

View details
Hide details

Model Info

Summary

FastViT is a general-purpose, hybrid vision transformer model, trained on the ImageNet dataset, that provides a state-of-the-art accuracy/latency trade-off.

The model's high performance, low latency, and robustness against out-of-distribution samples result from three novel architectural strategies:

Structural reparameterization
Linear training-time overparameterization
Use of large kernel convolutions

FastViT consistently outperforms competing robust architectures on mobile and desktop GPU platforms across a wide range of computer vision tasks such as image classification, object detection, semantic segmentation, and 3D mesh regression.

Use Cases

Image classification, object detection, semantic segmentation, 3D mesh regression

Variants

Model Name	Size	Action
FastViTMA36F16.mlpackage	88.3MB	Download
FastViTT8F16.mlpackage	8.2MB	Download
FastViTMA36F16Headless.mlpackage	85.8MB	Download
FastViTT8F16Headless.mlpackage	6.5MB	Download

Variant	Parameters	Size	Weight Precision	Activation Precision
T8	3.6M	7.8	Float16	Float16
MA36	42.7M	84	Float16	Float16

Inference Time

Variant	Device	OS	Inference Time (ms)	Compute Unit
T8 F16	iPhone 16 Pro	18.3	0.52	All
T8 F16	iPhone 15 Pro Max	17.6	0.67	All
T8 F16	iPhone 15 Plus	17.6	0.73	All
T8 F16	iPhone 14 Plus	17.6	0.82	All
T8 F16	iPhone 13 Pro Max	17.6	0.83	All
T8 F16	MacBook Pro M3 Max	14.4	0.62	All
MA36 F16	iPhone 16 Pro	18.3	2.78	All
MA36 F16	iPhone 15 Pro Max	17.6	3.33	All
MA36 F16	iPhone 15 Plus	17.6	3.47	All
MA36 F16	iPhone 14 Plus	17.6	4.56	All
MA36 F16	iPhone 13 Pro Max	17.6	4.47	All
MA36 F16	MacBook Pro M2 Max	15.0	2.94	All
MA36 F16	MacBook Pro M1 Max	15.0	4	All
MA36 F16	iPad Pro 5th Gen	17.5	3.35	All

Example Projects

Classifying Images with Vision and Core ML
Preprocess photos using the Vision framework and classify them with a Core ML model.

image

Depth Anything V2

Depth Estimation

The Depth Anything model performs monocular depth estimation.

View details
Hide details

Model Info

Summary

Depth Anything v2 is a foundation model for monocular depth estimation. It maintains the strengths and rectifies the weaknesses of the original Depth Anything by refining the powerful data curation engine and teacher-student pipeline.

To train a teacher model, Depth Anything v2 uses purely synthetic, computer-generated images. This avoids problems created by using real images, which can limit monocular depth-estimation model performance due to noisy annotations and low resolution. The teacher model predicts depth information on unlabeled real images, and then uses only that new, pseudo-labeled data to train a student model. This helps avoid distribution shift between synthetic and real images.

On the depth estimation task, the Depth Anything v2 model optimizes and outperforms v1 especially in terms of robustness, inference speed, and image depth properties like fine-grained details, transparent objects, reflections, and complex scenes. Its refined data curation approach results in competitive performance on standard datasets (including KITTI, NYU-D, Sintel, ETH3D, and DIODE) and a more than 9% accuracy improvement over v1 and other community models on the new DA-2k evaluation set built for depth estimation.

Depth Anything v2 provides varied model scales and inference efficiency to support extensive applications and is generalizable for fine tuning to downstream tasks. It can be used in any application requiring depth estimation, such as 3D reconstruction, navigation, autonomous driving, and image or video generation.

Use Cases

Depth estimation, semantic segmentation

Variants

Model Name	Size	Action
DepthAnythingV2SmallF16.mlpackage	49.8MB	Download
DepthAnythingV2SmallF16P6.mlpackage	19MB	Download

Variant	Parameters	Size	Weight Precision	Activation Precision
F32	24.8M	99.2	Float32	Float32
F16	24.8M	49.8	Float16	Float16

Inference Time

Variant	Device	OS	Inference Time (ms)	Compute Unit
Small F16	iPhone 16 Pro	18.3	26.21	All
Small F16	iPhone 15 Pro Max	17.4	33.90	All
Small F16	MacBook Pro M1 Max	15.0	33.48	All
Small F16	MacBook Pro M1 Max	15.0	32.78	GPU

image

DETR Resnet50 Semantic Segmentation

Semantic Segmentation

The DEtection TRansformer (DETR) model, trained for object detection and panoptic segmentation, configured to return semantic segmentation masks.

View details
Hide details

Model Info

Summary

The DETR model is an encoder/decoder transformer with a convolutional backbone trained on the COCO 2017 dataset. It blends a set of proven ML strategies to detect and classify objects in images more elegantly than standard object detectors can, while matching their performance.

The model is trained with a loss function that performs bipartite matching between predicted and ground-truth objects. At inference time, DETR applies self-attention to an image globally to predict all objects at once. Thanks to global attention, the model outperforms standard object detectors on large objects but underperforms on small objects. Despite this limitation, DETR demonstrates accuracy and run-time performance on par with other highly optimized architectures when evaluated on the challenging COCO dataset.

DETR can be easily reproduced in any framework that contains standard CNN and transformer classes. It can also be easily generalized to accommodate more complex tasks, such as panoptic segmentation and other tasks requiring a simple segmentation head trained on top of a pre-trained DETR.

DETR avoids clunky surrogate tasks and hand-designed components that traditional architectures require to achieve acceptable performance and instead provides a conceptually simple, easily reproducible approach that streamlines the object detection pipeline.

Use Cases

Object detection, panoptic segmentation

Variants

Model Name	Size	Action
DETRResnet50SemanticSegmentationF16.mlpackage	85.5MB	Download
DETRResnet50SemanticSegmentationF16P8.mlpackage	43.1MB	Download

Variant	Parameters	Size	Weight Precision	Activation Precision
F32	43M	171	Float32	Float32
F16	43M	86	Float16	Float16

Inference Time

Variant	Device	OS	Inference Time (ms)	Compute Unit
F16	iPhone 16 Pro	18.3	34.32	All
F16	iPhone 15 Pro Max	17.6	39	All
F16	iPhone 15 Plus	17.6	43	All
F16	iPhone 14 Plus	17.6	50	All
F16	iPhone 14	17.5	51	All
F16	iPhone 13 Pro Max	17.6	51	All
F16	MacBook Pro M1 Max	15.0	117	All
F16	MacBook Pro M1 Max	15.0	43	GPU
F16P8	iPhone 16 Pro	18.3	32.23	All
F16P8	iPhone 15 Plus	18.0	40.73	All
F16P8	iPhone 13 Pro Max	17.6	51.53	All
F16P8	MacBook Pro M1 Max	15.0	36.52	All
F16P8	MacBook Pro M1 Max	15.0	33.14	GPU
F16P8	iPad Pro 5th Generation	18.0	62.49	All
F16P8	iPad Pro 4th Generation	18.0	1224	All

Example Projects

Using Core ML for Semantic Image Segmentation
Identify multiple objects in an image by using the DEtection TRansformer image-segmentation model.

text

BERT-SQuAD

Question Answering

Find answers to questions about paragraphs of text.

View details
Hide details

Model Info

Summary

BERT (Bidirectional Encoder Representations from Transformers) is a language representation model that uses fine-tuning-based approaches to apply pre-trained representations to downstream NLP tasks. In the case of BERT-SQuAD, the downstream NLP task is context-based Question Answering.

BERT's multilayer, bidirectional transformer encoder architecture is used across both pre-training and fine-tuning steps. BERT-SQuAD adapts it for extracting precise answers given a question and a related context from the Stanford Question Answering Dataset (SQuAD).

BERT is pre-trained on the BooksCorpus and English Wikipedia text passages using two unsupervised pre-training tasks. It uses a masked language model task to pre-train a deep, bidirectional self-attention transformer and a next-sentence prediction task to jointly pre-train text-pair representations that are conditioned on both left and right context in all layers.

For fine-tuning, BERT-SQuAD is initialized with the parameters obtained during pre-training. Then, all of the parameters are fine-tuned using labeled data from the Stanford Question Answering Dataset.

In general, fine-tuning BERT to your specific NLP task is straightforward and inexpensive: all token-level and sentence-level task-specific models in the BERT paper were formed by incorporating BERT with just one additional output layer.

Use Cases

Question answering

Variants

Model Name	Size	Action
BERTSQUADFP16.mlmodel	217.8MB	Download

Example Projects

Finding Answers to Questions in a Text Document
Locate relevant passages in a document by asking the Bidirectional Encoder Representations from Transformers (BERT) model a question.

image

DeepLabv3

Image Segmentation

Segment the pixels of a camera frame or image into a predefined set of classes.

View details
Hide details

Model Info

Summary

DeepLabv3 is a multi-module deep learning architecture for semantic image segmentation, a task that involves assigning a category label to each pixel in an image. Much like its previous versions, DeepLabv3 uses atrous (diluted) convolution on this task, which explicitly adjusts the filter’s field-of-view for a given image and controls the resolution of feature responses computed by Deep Convolutional Neural Networks (DCNNs).

DeepLabv3’s modules employ atrous convolution in parallel to capture multi-scale context by adopting multiple atrous rates. This technique helps to more accurately segment objects at varying scales. The system also uses an augmented version of Atrous Spatial Pyramid Pooling, a module that probes convolutional features at multiple scales with image-level features. This technique encodes global context and further boosts performance.

The DeepLabv3 system significantly improves performance over previous DeepLab versions, without needing to use DenseCRF post-processing to improve the localization of object boundaries. More generally, every version of the DeepLab system addresses the two other classic problems that come from using DCNNs for semantic image segmentation: reduced feature resolution and the existence of objects at multiple scales.

Use Cases

Semantic segmentation

Variants

Model Name	Size	Action
DeepLabV3.mlmodel	8.6MB	Download
DeepLabV3FP16.mlmodel	4.3MB	Download
DeepLabV3Int8LUT.mlmodel	2.3MB	Download

image

MNIST

Drawing Classification

Classify a single handwritten digit (supports digits 0-9).

View details
Hide details

Model Info

Summary

The Modified National Institute of Standards and Technology (MNIST) dataset is a collection of handwritten digits from zero to nine, designed for training and testing image classification systems. It includes 60,000 training images and 10,000 testing images, all of which are grayscale and ×ばつ28 pixels in size.

The MNIST Classifier model in this gallery was trained using the same technique as the Turi Create-trained Drawing Classifier toolkit in the Links section: it is a convolutional neural network (CNN) consisting of three convolutions with Rectified Linear Unit (ReLU) activations, followed by max-pooling, with two fully connected layers in the end.

Use Cases

Digit classification, digit recognition, image classification

Variants

Model Name	Size	Action
MNISTClassifier.mlmodel	395KB	Download

image

MobileNetV2

Image Classification

The MobileNetv2 architecture trained to classify the dominant object in a camera frame or image.

View details
Hide details

Model Info

Summary

MobileNetV2 is a neural network architecture that employs a classification algorithm specifically written to reduce computation and memory for resource-constrained systems like a mobile phone.

The network uses a sequence of convolution, batch normalization, and a Rectified Linear Unit (ReLU) as bookends to its novel module, which is made up of a series of residual bottleneck layers. It follows this sequence with adaptive average pooling and a linear classifier head. The new module takes in a low-dimensional compressed feature map of the image, expands it to a high dimension with point-wise convolution, filters it with a depth-wise convolution, and projects the resulting features back to a low-dimensional representation with a linear convolution. This new architecture helps the network strike an optimal balance between accuracy and performance.

MobileNetV2 improves Top-1 accuracy over MobileNetV1, with fewer parameters and less CPU usage, on the ImageNet classification task. MobileNetV2 also achieves competitive Mean Average Precision (mAP) over YOLOv2 and MobileNetV1 on the COCO dataset object detection task, with significantly fewer parameters and smaller computational complexity.

Use Cases

Object classification, object detection, semantic segmentation

Variants

Model Name	Size	Action
MobileNetV2.mlmodel	24.7MB	Download
MobileNetV2FP16.mlmodel	12.4MB	Download
MobileNetV2Int8LUT.mlmodel	6.3MB	Download

Example Projects

Classifying Images with Vision and Core ML
Preprocess photos using the Vision framework and classify them with a Core ML model.

image

ResNet-50

Image Classification

A Residual Neural Network that will classify the dominant object in a camera frame or image.

View details
Hide details

Model Info

Summary

ResNet-50 is a residual learning framework that accelerates the training of deep neural networks. It is pre-trained on the ImageNet dataset, which contains 1000 diverse object categories, to learn general features that are useful for a range of image classification tasks.

Deep neural networks are difficult to train using traditional architectures because training gets slower as networks get deeper, and it takes many more training epochs to minimize the loss. The architecture of ResNet-50 addresses this challenge by framing it as a residual learning problem. It groups layers of a network into modular collections called residual blocks and uses links called shortcut connections to funnel data both through and around these blocks. Each layer within a residual block will pass its data forward to the next layer normally, but between blocks the shortcut connection uniquely combines the original input to the block with the output from the block to be used as input to the next block.

The residual blocks and shortcut connections in the ResNet-50 architecture accelerate training and minimize complexity. A network built with this architecture won first place on the image classification tasks as part of the 2015 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). It also achieved 83.8% mean average precision (mAP) on the PASCAL VOC 2012 test set, ten points higher than the previous state-of-the-art result in 2015.

Use Cases

Object detection, semantic segmentation, image recognition, image localization

Variants

Model Name	Size	Action
Resnet50.mlmodel	102.6MB	Download
Resnet50FP16.mlmodel	51.3MB	Download
Resnet50Int8LUT.mlmodel	25.8MB	Download
Resnet50Headless.mlmodel	94.4MB	Download

Example Projects

Classifying Images with Vision and Core ML
Preprocess photos using the Vision framework and classify them with a Core ML model.

image

Updatable Drawing Classifier

Drawing Classification

Drawing classifier that learns to recognize new drawings based on a K-Nearest Neighbors model (KNN).

View details
Hide details

Model Info

Summary

The Updatable Drawing Classifier is a model that can be used to train a simple drawing or sketch classifier based on user examples. The model is a pipeline composed of a drawing-embedding model used as a feature extractor, and a nearest-neighbor classifier operating on the embeddings.

The embedding model takes in a 28x28 grayscale image and outputs a 128-dimensional float vector. The nearest-neighbor classifier takes that vector as input and predicts a label for it based on three nearest neighbors, along with a probability for that label.

Use Cases

Drawing classification

Variants

Model Name	Size	Action
UpdatableDrawingClassifier.mlmodel	382KB	Download

Example Projects

Personalizing a Model with On-Device Updates
Learn to map drawings from a user to custom stickers by updating a drawing classification model on device.

image

YOLOv3

Object Detection

Locate and classify 80 different types of objects present in a camera frame or image.

View details
Hide details

Model Info

Summary

YOLOv3 is the third iteration in the You Only Look Once (YOLO) series of object detection algorithms. It provides real-time object detection by using a single neural network to predict the bounding boxes and class probabilities of objects in an image or video.

The original YOLO algorithm introduced a new approach to object detection. It employed batch normalization (BN) and leaky ReLU activations to unite feature extraction and object localization steps and to unite the localization and classification heads. This single-stage architecture accelerated inference time to set a new state of the art for object detection models when it was first published.

YOLOv3 uses a new feature extractor called DarkNet-53. Inspired by ResNet and Feature-Pyramid Network (FPN), this backbone classifier network uses techniques from those architectures like skip connections, residual blocks, 53 convolutional layers, and three prediction heads processing each image at different spatial compressions. The combined techniques help the model detect objects of varying scales in an image and maintain performance over a wide range of input resolutions.

Use Cases

Multiple object detection, object localization

Variants

Model Name	Size	Action
YOLOv3.mlmodel	248.4MB	Download
YOLOv3FP16.mlmodel	124.2MB	Download
YOLOv3Int8LUT.mlmodel	62.2MB	Download
YOLOv3Tiny.mlmodel	35.4MB	Download
YOLOv3TinyFP16.mlmodel	17.7MB	Download
YOLOv3TinyInt8LUT.mlmodel	8.9MB	Download

Example Projects

Recognizing Objects in Live Capture
Apply Vision algorithms to identify objects in real-time video.

Core ML Models

FastViT

Model Info

Summary

Use Cases

Links

Variants

Inference Time

Example Projects

Depth Anything V2

Model Info

Summary

Use Cases

Links

Variants

Inference Time

DETR Resnet50 Semantic Segmentation

Model Info

Summary

Use Cases

Links

Variants

Inference Time

Example Projects

BERT-SQuAD

Model Info

Summary

Use Cases

Links

Variants

Example Projects

DeepLabv3

Model Info

Summary

Use Cases

Links

Variants

MNIST

Model Info

Summary

Use Cases

Links

Variants

MobileNetV2

Model Info

Summary

Use Cases

Links

Variants

Example Projects

ResNet-50

Model Info

Summary

Use Cases

Links

Variants

Example Projects

Updatable Drawing Classifier

Model Info

Summary

Use Cases

Links

Variants

Example Projects

YOLOv3

Model Info

Summary

Use Cases

Links

Variants

Example Projects

No Results.