Core ML Models

Build intelligence into your apps using machine learning models from the research community designed for Core ML.

Models are in Core ML format and can be integrated into Xcode projects. You can select different versions of models to optimize for sizes and architectures.

image

FastViT

Image Classification

A Fast Hybrid Vision Transformer architecture trained to classify the dominant object in a camera frame or image.


View details

Hide details


Model Info

Summary

FastViT is a general-purpose, hybrid vision transformer model, trained on the ImageNet dataset, that provides a state-of-the-art accuracy/latency trade-off.

The model's high performance, low latency, and robustness against out-of-distribution samples result from three novel architectural strategies:

  • Structural reparameterization
  • Linear training-time overparameterization
  • Use of large kernel convolutions

FastViT consistently outperforms competing robust architectures on mobile and desktop GPU platforms across a wide range of computer vision tasks such as image classification, object detection, semantic segmentation, and 3D mesh regression.

Use Cases

Image classification, object detection, semantic segmentation, 3D mesh regression

Links

Variants

    Model Name Size Action
    FastViTMA36F16.mlpackage 88.3MB Download
    FastViTT8F16.mlpackage 8.2MB Download
    FastViTMA36F16Headless.mlpackage 85.8MB Download
    FastViTT8F16Headless.mlpackage 6.5MB Download
    Variant Parameters Size Weight Precision Activation Precision
    T8 3.6M 7.8 Float16 Float16
    MA36 42.7M 84 Float16 Float16

Inference Time

    Variant Device OS Inference Time (ms) Compute Unit
    T8 F16 iPhone 16 Pro 18.3 0.52 All
    T8 F16 iPhone 15 Pro Max 17.6 0.67 All
    T8 F16 iPhone 15 Plus 17.6 0.73 All
    T8 F16 iPhone 14 Plus 17.6 0.82 All
    T8 F16 iPhone 13 Pro Max 17.6 0.83 All
    T8 F16 MacBook Pro M3 Max 14.4 0.62 All
    MA36 F16 iPhone 16 Pro 18.3 2.78 All
    MA36 F16 iPhone 15 Pro Max 17.6 3.33 All
    MA36 F16 iPhone 15 Plus 17.6 3.47 All
    MA36 F16 iPhone 14 Plus 17.6 4.56 All
    MA36 F16 iPhone 13 Pro Max 17.6 4.47 All
    MA36 F16 MacBook Pro M2 Max 15.0 2.94 All
    MA36 F16 MacBook Pro M1 Max 15.0 4 All
    MA36 F16 iPad Pro 5th Gen 17.5 3.35 All

Example Projects

image

Depth Anything V2

Depth Estimation

The Depth Anything model performs monocular depth estimation.


View details

Hide details


Model Info

Summary

Depth Anything v2 is a foundation model for monocular depth estimation. It maintains the strengths and rectifies the weaknesses of the original Depth Anything by refining the powerful data curation engine and teacher-student pipeline.

To train a teacher model, Depth Anything v2 uses purely synthetic, computer-generated images. This avoids problems created by using real images, which can limit monocular depth-estimation model performance due to noisy annotations and low resolution. The teacher model predicts depth information on unlabeled real images, and then uses only that new, pseudo-labeled data to train a student model. This helps avoid distribution shift between synthetic and real images.

On the depth estimation task, the Depth Anything v2 model optimizes and outperforms v1 especially in terms of robustness, inference speed, and image depth properties like fine-grained details, transparent objects, reflections, and complex scenes. Its refined data curation approach results in competitive performance on standard datasets (including KITTI, NYU-D, Sintel, ETH3D, and DIODE) and a more than 9% accuracy improvement over v1 and other community models on the new DA-2k evaluation set built for depth estimation.

Depth Anything v2 provides varied model scales and inference efficiency to support extensive applications and is generalizable for fine tuning to downstream tasks. It can be used in any application requiring depth estimation, such as 3D reconstruction, navigation, autonomous driving, and image or video generation.

Use Cases

Depth estimation, semantic segmentation

Links

Variants

    Model Name Size Action
    DepthAnythingV2SmallF16.mlpackage 49.8MB Download
    DepthAnythingV2SmallF16P6.mlpackage 19MB Download
    Variant Parameters Size Weight Precision Activation Precision
    F32 24.8M 99.2 Float32 Float32
    F16 24.8M 49.8 Float16 Float16

Inference Time

    Variant Device OS Inference Time (ms) Compute Unit
    Small F16 iPhone 16 Pro 18.3 26.21 All
    Small F16 iPhone 15 Pro Max 17.4 33.90 All
    Small F16 MacBook Pro M1 Max 15.0 33.48 All
    Small F16 MacBook Pro M1 Max 15.0 32.78 GPU

image

DETR Resnet50 Semantic Segmentation

Semantic Segmentation

The DEtection TRansformer (DETR) model, trained for object detection and panoptic segmentation, configured to return semantic segmentation masks.


View details

Hide details


Model Info

Summary

The DETR model is an encoder/decoder transformer with a convolutional backbone trained on the COCO 2017 dataset. It blends a set of proven ML strategies to detect and classify objects in images more elegantly than standard object detectors can, while matching their performance.

The model is trained with a loss function that performs bipartite matching between predicted and ground-truth objects. At inference time, DETR applies self-attention to an image globally to predict all objects at once. Thanks to global attention, the model outperforms standard object detectors on large objects but underperforms on small objects. Despite this limitation, DETR demonstrates accuracy and run-time performance on par with other highly optimized architectures when evaluated on the challenging COCO dataset.

DETR can be easily reproduced in any framework that contains standard CNN and transformer classes. It can also be easily generalized to accommodate more complex tasks, such as panoptic segmentation and other tasks requiring a simple segmentation head trained on top of a pre-trained DETR.

DETR avoids clunky surrogate tasks and hand-designed components that traditional architectures require to achieve acceptable performance and instead provides a conceptually simple, easily reproducible approach that streamlines the object detection pipeline.

Use Cases

Object detection, panoptic segmentation

Links

Variants

    Model Name Size Action
    DETRResnet50SemanticSegmentationF16.mlpackage 85.5MB Download
    DETRResnet50SemanticSegmentationF16P8.mlpackage 43.1MB Download
    Variant Parameters Size Weight Precision Activation Precision
    F32 43M 171 Float32 Float32
    F16 43M 86 Float16 Float16

Inference Time

    Variant Device OS Inference Time (ms) Compute Unit
    F16 iPhone 16 Pro 18.3 34.32 All
    F16 iPhone 15 Pro Max 17.6 39 All
    F16 iPhone 15 Plus 17.6 43 All
    F16 iPhone 14 Plus 17.6 50 All
    F16 iPhone 14 17.5 51 All
    F16 iPhone 13 Pro Max 17.6 51 All
    F16 MacBook Pro M1 Max 15.0 117 All
    F16 MacBook Pro M1 Max 15.0 43 GPU
    F16P8 iPhone 16 Pro 18.3 32.23 All
    F16P8 iPhone 15 Plus 18.0 40.73 All
    F16P8 iPhone 13 Pro Max 17.6 51.53 All
    F16P8 MacBook Pro M1 Max 15.0 36.52 All
    F16P8 MacBook Pro M1 Max 15.0 33.14 GPU
    F16P8 iPad Pro 5th Generation 18.0 62.49 All
    F16P8 iPad Pro 4th Generation 18.0 1224 All

Example Projects

text

BERT-SQuAD

Question Answering

Find answers to questions about paragraphs of text.


View details

Hide details


Model Info

Summary

BERT (Bidirectional Encoder Representations from Transformers) is a language representation model that uses fine-tuning-based approaches to apply pre-trained representations to downstream NLP tasks. In the case of BERT-SQuAD, the downstream NLP task is context-based Question Answering.

BERT's multilayer, bidirectional transformer encoder architecture is used across both pre-training and fine-tuning steps. BERT-SQuAD adapts it for extracting precise answers given a question and a related context from the Stanford Question Answering Dataset (SQuAD).

BERT is pre-trained on the BooksCorpus and English Wikipedia text passages using two unsupervised pre-training tasks. It uses a masked language model task to pre-train a deep, bidirectional self-attention transformer and a next-sentence prediction task to jointly pre-train text-pair representations that are conditioned on both left and right context in all layers.

For fine-tuning, BERT-SQuAD is initialized with the parameters obtained during pre-training. Then, all of the parameters are fine-tuned using labeled data from the Stanford Question Answering Dataset.

In general, fine-tuning BERT to your specific NLP task is straightforward and inexpensive: all token-level and sentence-level task-specific models in the BERT paper were formed by incorporating BERT with just one additional output layer.

Use Cases

Question answering

Links

Variants

    Model Name Size Action
    BERTSQUADFP16.mlmodel 217.8MB Download

Example Projects

image

DeepLabv3

Image Segmentation

Segment the pixels of a camera frame or image into a predefined set of classes.


View details

Hide details


Model Info

Summary

DeepLabv3 is a multi-module deep learning architecture for semantic image segmentation, a task that involves assigning a category label to each pixel in an image. Much like its previous versions, DeepLabv3 uses atrous (diluted) convolution on this task, which explicitly adjusts the filter’s field-of-view for a given image and controls the resolution of feature responses computed by Deep Convolutional Neural Networks (DCNNs).

DeepLabv3’s modules employ atrous convolution in parallel to capture multi-scale context by adopting multiple atrous rates. This technique helps to more accurately segment objects at varying scales. The system also uses an augmented version of Atrous Spatial Pyramid Pooling, a module that probes convolutional features at multiple scales with image-level features. This technique encodes global context and further boosts performance.

The DeepLabv3 system significantly improves performance over previous DeepLab versions, without needing to use DenseCRF post-processing to improve the localization of object boundaries. More generally, every version of the DeepLab system addresses the two other classic problems that come from using DCNNs for semantic image segmentation: reduced feature resolution and the existence of objects at multiple scales.

Use Cases

Semantic segmentation

Links

Variants

    Model Name Size Action
    DeepLabV3.mlmodel 8.6MB Download
    DeepLabV3FP16.mlmodel 4.3MB Download
    DeepLabV3Int8LUT.mlmodel 2.3MB Download

image

MNIST

Drawing Classification

Classify a single handwritten digit (supports digits 0-9).


View details

Hide details


Model Info

Summary

The Modified National Institute of Standards and Technology (MNIST) dataset is a collection of handwritten digits from zero to nine, designed for training and testing image classification systems. It includes 60,000 training images and 10,000 testing images, all of which are grayscale and ×ばつ28 pixels in size.

The MNIST Classifier model in this gallery was trained using the same technique as the Turi Create-trained Drawing Classifier toolkit in the Links section: it is a convolutional neural network (CNN) consisting of three convolutions with Rectified Linear Unit (ReLU) activations, followed by max-pooling, with two fully connected layers in the end.

Use Cases

Digit classification, digit recognition, image classification

Links

Variants

    Model Name Size Action
    MNISTClassifier.mlmodel 395KB Download

image

MobileNetV2

Image Classification

The MobileNetv2 architecture trained to classify the dominant object in a camera frame or image.


View details

Hide details


Model Info

Summary

MobileNetV2 is a neural network architecture that employs a classification algorithm specifically written to reduce computation and memory for resource-constrained systems like a mobile phone.

The network uses a sequence of convolution, batch normalization, and a Rectified Linear Unit (ReLU) as bookends to its novel module, which is made up of a series of residual bottleneck layers. It follows this sequence with adaptive average pooling and a linear classifier head. The new module takes in a low-dimensional compressed feature map of the image, expands it to a high dimension with point-wise convolution, filters it with a depth-wise convolution, and projects the resulting features back to a low-dimensional representation with a linear convolution. This new architecture helps the network strike an optimal balance between accuracy and performance.

MobileNetV2 improves Top-1 accuracy over MobileNetV1, with fewer parameters and less CPU usage, on the ImageNet classification task. MobileNetV2 also achieves competitive Mean Average Precision (mAP) over YOLOv2 and MobileNetV1 on the COCO dataset object detection task, with significantly fewer parameters and smaller computational complexity.

Use Cases

Object classification, object detection, semantic segmentation

Links

Variants

    Model Name Size Action
    MobileNetV2.mlmodel 24.7MB Download
    MobileNetV2FP16.mlmodel 12.4MB Download
    MobileNetV2Int8LUT.mlmodel 6.3MB Download

Example Projects

image

ResNet-50

Image Classification

A Residual Neural Network that will classify the dominant object in a camera frame or image.


View details

Hide details


Model Info

Summary

ResNet-50 is a residual learning framework that accelerates the training of deep neural networks. It is pre-trained on the ImageNet dataset, which contains 1000 diverse object categories, to learn general features that are useful for a range of image classification tasks.

Deep neural networks are difficult to train using traditional architectures because training gets slower as networks get deeper, and it takes many more training epochs to minimize the loss. The architecture of ResNet-50 addresses this challenge by framing it as a residual learning problem. It groups layers of a network into modular collections called residual blocks and uses links called shortcut connections to funnel data both through and around these blocks. Each layer within a residual block will pass its data forward to the next layer normally, but between blocks the shortcut connection uniquely combines the original input to the block with the output from the block to be used as input to the next block.

The residual blocks and shortcut connections in the ResNet-50 architecture accelerate training and minimize complexity. A network built with this architecture won first place on the image classification tasks as part of the 2015 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). It also achieved 83.8% mean average precision (mAP) on the PASCAL VOC 2012 test set, ten points higher than the previous state-of-the-art result in 2015.

Use Cases

Object detection, semantic segmentation, image recognition, image localization

Links

Variants

    Model Name Size Action
    Resnet50.mlmodel 102.6MB Download
    Resnet50FP16.mlmodel 51.3MB Download
    Resnet50Int8LUT.mlmodel 25.8MB Download
    Resnet50Headless.mlmodel 94.4MB Download

Example Projects

image

Updatable Drawing Classifier

Drawing Classification

Drawing classifier that learns to recognize new drawings based on a K-Nearest Neighbors model (KNN).


View details

Hide details


Model Info

Summary

The Updatable Drawing Classifier is a model that can be used to train a simple drawing or sketch classifier based on user examples. The model is a pipeline composed of a drawing-embedding model used as a feature extractor, and a nearest-neighbor classifier operating on the embeddings.

The embedding model takes in a 28x28 grayscale image and outputs a 128-dimensional float vector. The nearest-neighbor classifier takes that vector as input and predicts a label for it based on three nearest neighbors, along with a probability for that label.

Use Cases

Drawing classification

Links

Variants

    Model Name Size Action
    UpdatableDrawingClassifier.mlmodel 382KB Download

Example Projects

image

YOLOv3

Object Detection

Locate and classify 80 different types of objects present in a camera frame or image.


View details

Hide details


Model Info

Summary

YOLOv3 is the third iteration in the You Only Look Once (YOLO) series of object detection algorithms. It provides real-time object detection by using a single neural network to predict the bounding boxes and class probabilities of objects in an image or video.

The original YOLO algorithm introduced a new approach to object detection. It employed batch normalization (BN) and leaky ReLU activations to unite feature extraction and object localization steps and to unite the localization and classification heads. This single-stage architecture accelerated inference time to set a new state of the art for object detection models when it was first published.

YOLOv3 uses a new feature extractor called DarkNet-53. Inspired by ResNet and Feature-Pyramid Network (FPN), this backbone classifier network uses techniques from those architectures like skip connections, residual blocks, 53 convolutional layers, and three prediction heads processing each image at different spatial compressions. The combined techniques help the model detect objects of varying scales in an image and maintain performance over a wide range of input resolutions.

Use Cases

Multiple object detection, object localization

Links

Variants

    Model Name Size Action
    YOLOv3.mlmodel 248.4MB Download
    YOLOv3FP16.mlmodel 124.2MB Download
    YOLOv3Int8LUT.mlmodel 62.2MB Download
    YOLOv3Tiny.mlmodel 35.4MB Download
    YOLOv3TinyFP16.mlmodel 17.7MB Download
    YOLOv3TinyInt8LUT.mlmodel 8.9MB Download

Example Projects

No Results.