Note
Go to the end to download the full example code.
TorchVision Object Detection Finetuning Tutorial#
Created On: Dec 14, 2023 | Last Updated: Sep 05, 2025 | Last Verified: Nov 05, 2024
For this tutorial, we will be finetuning a pre-trained Mask R-CNN model on the Penn-Fudan Database for Pedestrian Detection and Segmentation. It contains 170 images with 345 instances of pedestrians, and we will use it to illustrate how to use the new features in torchvision in order to train an object detection and instance segmentation model on a custom dataset.
Note
This tutorial works only with torchvision version >=0.16 or nightly. If you’re using torchvision<=0.15, please follow this tutorial instead.
Defining the Dataset#
The reference scripts for training object detection, instance
segmentation and person keypoint detection allows for easily supporting
adding new custom datasets. The dataset should inherit from the standard
torch.utils.data.Dataset
class, and implement __len__
and
__getitem__
.
The only specificity that we require is that the dataset __getitem__
should return a tuple:
image:
torchvision.tv_tensors.Image
of shape[3, H, W]
, a pure tensor, or a PIL Image of size(H, W)
target: a dict containing the following fields
boxes
,torchvision.tv_tensors.BoundingBoxes
of shape[N, 4]
: the coordinates of theN
bounding boxes in[x0, y0, x1, y1]
format, ranging from0
toW
and0
toH
labels
, integertorch.Tensor
of shape[N]
: the label for each bounding box.0
represents always the background class.image_id
, int: an image identifier. It should be unique between all the images in the dataset, and is used during evaluationarea
, floattorch.Tensor
of shape[N]
: the area of the bounding box. This is used during evaluation with the COCO metric, to separate the metric scores between small, medium and large boxes.iscrowd
, uint8torch.Tensor
of shape[N]
: instances withiscrowd=True
will be ignored during evaluation.(optionally)
masks
,torchvision.tv_tensors.Mask
of shape[N, H, W]
: the segmentation masks for each one of the objects
If your dataset is compliant with above requirements then it will work for both
training and evaluation codes from the reference script. Evaluation code will use scripts from
pycocotools
which can be installed with pip install pycocotools
.
Note
For Windows, please install pycocotools
from gautamchitnis with command
pip install git+https://github.com/gautamchitnis/cocoapi.git@cocodataset-master#subdirectory=PythonAPI
One note on the labels
. The model considers class 0
as background. If your dataset does not contain the background class,
you should not have 0
in your labels
. For example, assuming you have just two classes, cat and dog, you can
define 1
(not 0
) to represent cats and 2
to represent dogs. So, for instance, if one of the images has both
classes, your labels
tensor should look like [1, 2]
.
Additionally, if you want to use aspect ratio grouping during training
(so that each batch only contains images with similar aspect ratios),
then it is recommended to also implement a get_height_and_width
method, which returns the height and the width of the image. If this
method is not provided, we query all elements of the dataset via
__getitem__
, which loads the image in memory and is slower than if
a custom method is provided.
Writing a custom dataset for PennFudan#
Let’s write a dataset for the PennFudan dataset. First, let’s download the dataset and extract the zip file:
wget https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip -P data cd data && unzip PennFudanPed.zip
We have the following folder structure:
PennFudanPed/ PedMasks/ FudanPed00001_mask.png FudanPed00002_mask.png FudanPed00003_mask.png FudanPed00004_mask.png ... PNGImages/ FudanPed00001.png FudanPed00002.png FudanPed00003.png FudanPed00004.png
Here is one example of a pair of images and segmentation masks
importmatplotlib.pyplotasplt fromtorchvision.ioimport read_image image = read_image ("data/PennFudanPed/PNGImages/FudanPed00046.png") mask = read_image ("data/PennFudanPed/PedMasks/FudanPed00046_mask.png") plt.figure(figsize=(16, 8)) plt.subplot(121) plt.title("Image") plt.imshow(image .permute(1, 2, 0)) plt.subplot(122) plt.title("Mask") plt.imshow(mask .permute(1, 2, 0))
<matplotlib.image.AxesImage object at 0x7f29667822c0>
So each image has a corresponding
segmentation mask, where each color correspond to a different instance.
Let’s write a torch.utils.data.Dataset
class for this dataset.
In the code below, we are wrapping images, bounding boxes and masks into
torchvision.tv_tensors.TVTensor
classes so that we will be able to apply torchvision
built-in transformations (new Transforms API)
for the given object detection and segmentation task.
Namely, image tensors will be wrapped by torchvision.tv_tensors.Image
, bounding boxes into
torchvision.tv_tensors.BoundingBoxes
and masks into torchvision.tv_tensors.Mask
.
As torchvision.tv_tensors.TVTensor
are torch.Tensor
subclasses, wrapped objects are also tensors and inherit the plain
torch.Tensor
API. For more information about torchvision tv_tensors
see
this documentation.
importos importtorch fromtorchvision.ioimport read_image fromtorchvision.ops.boxesimport masks_to_boxes fromtorchvisionimport tv_tensors fromtorchvision.transforms.v2import functional as F classPennFudanDataset(torch.utils.data.Dataset ): def__init__(self, root, transforms): self.root = root self.transforms = transforms # load all image files, sorting them to # ensure that they are aligned self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages")))) self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks")))) def__getitem__(self, idx): # load images and masks img_path = os.path.join(self.root, "PNGImages", self.imgs[idx]) mask_path = os.path.join(self.root, "PedMasks", self.masks [idx]) img = read_image (img_path) mask = read_image (mask_path) # instances are encoded as different colors obj_ids = torch.unique (mask ) # first id is the background, so remove it obj_ids = obj_ids[1:] num_objs = len(obj_ids) # split the color-encoded mask into a set # of binary masks masks = (mask == obj_ids[:, None, None]).to(dtype=torch.uint8 ) # get bounding box coordinates for each mask boxes = masks_to_boxes (masks ) # there is only one class labels = torch.ones ((num_objs,), dtype=torch.int64 ) image_id = idx area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0]) # suppose all instances are not crowd iscrowd = torch.zeros ((num_objs,), dtype=torch.int64 ) # Wrap sample and targets into torchvision tv_tensors: img = tv_tensors.Image (img) target = {} target["boxes"] = tv_tensors.BoundingBoxes (boxes, format="XYXY", canvas_size=F.get_size(img)) target["masks"] = tv_tensors.Mask (masks ) target["labels"] = labels target["image_id"] = image_id target["area"] = area target["iscrowd"] = iscrowd if self.transforms is not None: img, target = self.transforms(img, target) return img, target def__len__(self): return len(self.imgs)
That’s all for the dataset. Now let’s define a model that can perform predictions on this dataset.
Defining your model#
In this tutorial, we will be using Mask R-CNN, which is based on top of Faster R-CNN. Faster R-CNN is a model that predicts both bounding boxes and class scores for potential objects in the image.
../_static/img/tv_tutorial/tv_image03.pngMask R-CNN adds an extra branch into Faster R-CNN, which also predicts segmentation masks for each instance.
../_static/img/tv_tutorial/tv_image04.pngThere are two common situations where one might want to modify one of the available models in TorchVision Model Zoo. The first is when we want to start from a pre-trained model, and just finetune the last layer. The other is when we want to replace the backbone of the model with a different one (for faster predictions, for example).
Let’s go see how we would do one or another in the following sections.
1 - Finetuning from a pretrained model#
Let’s suppose that you want to start from a model pre-trained on COCO and want to finetune it for your particular classes. Here is a possible way of doing it:
importtorchvision fromtorchvision.models.detection.faster_rcnnimport FastRCNNPredictor # load a model pre-trained on COCO model = torchvision.models.detection.fasterrcnn_resnet50_fpn (weights="DEFAULT") # replace the classifier with a new one, that has # num_classes which is user-defined num_classes = 2 # 1 class (person) + background # get number of input features for the classifier in_features = model.roi_heads.box_predictor.cls_score.in_features # replace the pre-trained head with a new one model.roi_heads.box_predictor = FastRCNNPredictor (in_features, num_classes)
Downloading: "https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth" to /var/lib/ci-user/.cache/torch/hub/checkpoints/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth 0%| | 0.00/160M [00:00<?, ?B/s] 26%|██▌ | 41.1M/160M [00:00<00:00, 431MB/s] 53%|█████▎ | 84.0M/160M [00:00<00:00, 441MB/s] 79%|███████▉ | 127M/160M [00:00<00:00, 444MB/s] 100%|██████████| 160M/160M [00:00<00:00, 444MB/s]
2 - Modifying the model to add a different backbone#
importtorchvision fromtorchvision.models.detectionimport FasterRCNN fromtorchvision.models.detection.rpnimport AnchorGenerator # load a pre-trained model for classification and return # only the features backbone = torchvision.models.mobilenet_v2 (weights="DEFAULT").features # ``FasterRCNN`` needs to know the number of # output channels in a backbone. For mobilenet_v2, it's 1280 # so we need to add it here backbone .out_channels = 1280 # let's make the RPN generate 5 x 3 anchors per spatial # location, with 5 different sizes and 3 different aspect # ratios. We have a Tuple[Tuple[int]] because each feature # map could potentially have different sizes and # aspect ratios anchor_generator = AnchorGenerator ( sizes=((32, 64, 128, 256, 512),), aspect_ratios=((0.5, 1.0, 2.0),) ) # let's define what are the feature maps that we will # use to perform the region of interest cropping, as well as # the size of the crop after rescaling. # if your backbone returns a Tensor, featmap_names is expected to # be [0]. More generally, the backbone should return an # ``OrderedDict[Tensor]``, and in ``featmap_names`` you can choose which # feature maps to use. roi_pooler = torchvision.ops.MultiScaleRoIAlign ( featmap_names=['0'], output_size=7, sampling_ratio=2 ) # put the pieces together inside a Faster-RCNN model model = FasterRCNN ( backbone , num_classes=2, rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler )
Downloading: "https://download.pytorch.org/models/mobilenet_v2-7ebf99e0.pth" to /var/lib/ci-user/.cache/torch/hub/checkpoints/mobilenet_v2-7ebf99e0.pth 0%| | 0.00/13.6M [00:00<?, ?B/s] 100%|██████████| 13.6M/13.6M [00:00<00:00, 363MB/s]
Object detection and instance segmentation model for PennFudan Dataset#
In our case, we want to finetune from a pre-trained model, given that our dataset is very small, so we will be following approach number 1.
Here we want to also compute the instance segmentation masks, so we will be using Mask R-CNN:
importtorchvision fromtorchvision.models.detection.faster_rcnnimport FastRCNNPredictor fromtorchvision.models.detection.mask_rcnnimport MaskRCNNPredictor defget_model_instance_segmentation(num_classes): # load an instance segmentation model pre-trained on COCO model = torchvision.models.detection.maskrcnn_resnet50_fpn (weights="DEFAULT") # get number of input features for the classifier in_features = model.roi_heads.box_predictor.cls_score.in_features # replace the pre-trained head with a new one model.roi_heads.box_predictor = FastRCNNPredictor (in_features, num_classes) # now get the number of input features for the mask classifier in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels hidden_layer = 256 # and replace the mask predictor with a new one model.roi_heads.mask_predictor = MaskRCNNPredictor ( in_features_mask, hidden_layer, num_classes ) return model
That’s it, this will make model
be ready to be trained and evaluated
on your custom dataset.
Putting everything together#
In references/detection/
, we have a number of helper functions to
simplify training and evaluating detection models. Here, we will use
references/detection/engine.py
and references/detection/utils.py
.
Just download everything under references/detection
to your folder and use them here.
On Linux if you have wget
, you can download them using below commands:
os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/engine.py") os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/utils.py") os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/coco_utils.py") os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/coco_eval.py") os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/transforms.py")
0
Since v0.15.0 torchvision provides new Transforms API to easily write data augmentation pipelines for Object Detection and Segmentation tasks.
Let’s write some helper functions for data augmentation / transformation:
fromtorchvision.transformsimport v2 as T defget_transform(train): transforms = [] if train: transforms.append(T.RandomHorizontalFlip (0.5)) transforms.append(T.ToDtype (torch.float , scale=True)) transforms.append(T.ToPureTensor ()) return T.Compose (transforms)
Testing forward()
method (Optional)#
Before iterating over the dataset, it’s good to see what the model expects during training and inference time on sample data.
importutils model = torchvision.models.detection.fasterrcnn_resnet50_fpn (weights="DEFAULT") dataset = PennFudanDataset ('data/PennFudanPed', get_transform(train=True)) data_loader = torch.utils.data.DataLoader ( dataset , batch_size=2, shuffle=True, collate_fn=utils.collate_fn ) # For Training images, targets = next(iter(data_loader )) images = list(image for image in images) targets = [{k: v for k, v in t.items()} for t in targets] output = model(images, targets) # Returns losses and detections print(output) # For inference model.eval () x = [torch.rand (3, 300, 400), torch.rand (3, 500, 400)] predictions = model(x ) # Returns predictions print(predictions[0])
{'loss_classifier': tensor(0.1095, grad_fn=<NllLossBackward0>), 'loss_box_reg': tensor(0.0877, grad_fn=<DivBackward0>), 'loss_objectness': tensor(0.0248, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'loss_rpn_box_reg': tensor(0.0054, grad_fn=<DivBackward0>)} {'boxes': tensor([], size=(0, 4), grad_fn=<StackBackward0>), 'labels': tensor([], dtype=torch.int64), 'scores': tensor([], grad_fn=<IndexBackward0>)}
We want to be able to train our model on an accelerator such as CUDA, MPS, MTIA, or XPU. Let’s now write the main function which performs the training and the validation:
fromengineimport train_one_epoch, evaluate # train on the accelerator or on the CPU, if an accelerator is not available device = torch.accelerator.current_accelerator () if torch.accelerator.is_available () else torch.device ('cpu') # our dataset has two classes only - background and person num_classes = 2 # use our dataset and defined transformations dataset = PennFudanDataset ('data/PennFudanPed', get_transform(train=True)) dataset_test = PennFudanDataset ('data/PennFudanPed', get_transform(train=False)) # split the dataset in train and test set indices = torch.randperm (len(dataset )).tolist() dataset = torch.utils.data.Subset (dataset , indices[:-50]) dataset_test = torch.utils.data.Subset (dataset_test , indices[-50:]) # define training and validation data loaders data_loader = torch.utils.data.DataLoader ( dataset , batch_size=2, shuffle=True, collate_fn=utils.collate_fn ) data_loader_test = torch.utils.data.DataLoader ( dataset_test , batch_size=1, shuffle=False, collate_fn=utils.collate_fn ) # get the model using our helper function model = get_model_instance_segmentation(num_classes) # move model to the right device model.to (device ) # construct an optimizer params = [p for p in model.parameters () if p.requires_grad] optimizer = torch.optim.SGD ( params, lr=0.005, momentum=0.9, weight_decay=0.0005 ) # and a learning rate scheduler lr_scheduler = torch.optim.lr_scheduler.StepLR ( optimizer , step_size=3, gamma=0.1 ) # let's train it just for 2 epochs num_epochs = 2 for epoch in range(num_epochs): # train for one epoch, printing every 10 iterations train_one_epoch(model, optimizer , data_loader , device , epoch, print_freq=10) # update the learning rate lr_scheduler.step () # evaluate on the test dataset evaluate(model, data_loader_test , device =device ) print("That's it!")
Downloading: "https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth" to /var/lib/ci-user/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth 0%| | 0.00/170M [00:00<?, ?B/s] 24%|██▍ | 41.2M/170M [00:00<00:00, 432MB/s] 50%|████▉ | 84.2M/170M [00:00<00:00, 443MB/s] 75%|███████▍ | 127M/170M [00:00<00:00, 446MB/s] 100%|██████████| 170M/170M [00:00<00:00, 446MB/s] /var/lib/workspace/intermediate_source/engine.py:30: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. Epoch: [0] [ 0/60] eta: 0:00:40 lr: 0.000090 loss: 5.0280 (5.0280) loss_classifier: 0.5262 (0.5262) loss_box_reg: 0.3641 (0.3641) loss_mask: 4.1229 (4.1229) loss_objectness: 0.0113 (0.0113) loss_rpn_box_reg: 0.0035 (0.0035) time: 0.6791 data: 0.0201 max mem: 2264 Epoch: [0] [10/60] eta: 0:00:12 lr: 0.000936 loss: 1.9412 (2.6945) loss_classifier: 0.3960 (0.3694) loss_box_reg: 0.3132 (0.2784) loss_mask: 1.2489 (2.0106) loss_objectness: 0.0205 (0.0305) loss_rpn_box_reg: 0.0035 (0.0056) time: 0.2584 data: 0.0175 max mem: 2432 Epoch: [0] [20/60] eta: 0:00:09 lr: 0.001783 loss: 0.8988 (1.7432) loss_classifier: 0.1887 (0.2645) loss_box_reg: 0.2292 (0.2586) loss_mask: 0.4047 (1.1882) loss_objectness: 0.0203 (0.0255) loss_rpn_box_reg: 0.0055 (0.0064) time: 0.2167 data: 0.0170 max mem: 2434 Epoch: [0] [30/60] eta: 0:00:06 lr: 0.002629 loss: 0.6437 (1.3382) loss_classifier: 0.1098 (0.2052) loss_box_reg: 0.2099 (0.2372) loss_mask: 0.2174 (0.8681) loss_objectness: 0.0155 (0.0215) loss_rpn_box_reg: 0.0055 (0.0063) time: 0.2126 data: 0.0159 max mem: 2455 Epoch: [0] [40/60] eta: 0:00:04 lr: 0.003476 loss: 0.4007 (1.1222) loss_classifier: 0.0584 (0.1698) loss_box_reg: 0.1603 (0.2237) loss_mask: 0.1785 (0.7049) loss_objectness: 0.0057 (0.0177) loss_rpn_box_reg: 0.0038 (0.0061) time: 0.2043 data: 0.0152 max mem: 2617 Epoch: [0] [50/60] eta: 0:00:02 lr: 0.004323 loss: 0.3509 (0.9808) loss_classifier: 0.0424 (0.1462) loss_box_reg: 0.1137 (0.2100) loss_mask: 0.1857 (0.6034) loss_objectness: 0.0028 (0.0154) loss_rpn_box_reg: 0.0038 (0.0059) time: 0.2027 data: 0.0157 max mem: 2617 Epoch: [0] [59/60] eta: 0:00:00 lr: 0.005000 loss: 0.3348 (0.8915) loss_classifier: 0.0424 (0.1325) loss_box_reg: 0.1268 (0.2011) loss_mask: 0.1686 (0.5386) loss_objectness: 0.0028 (0.0135) loss_rpn_box_reg: 0.0039 (0.0058) time: 0.2051 data: 0.0151 max mem: 2617 Epoch: [0] Total time: 0:00:13 (0.2173 s / it) creating index... index created! Test: [ 0/50] eta: 0:00:06 model_time: 0.0967 (0.0967) evaluator_time: 0.0142 (0.0142) time: 0.1230 data: 0.0115 max mem: 2617 Test: [49/50] eta: 0:00:00 model_time: 0.0438 (0.0612) evaluator_time: 0.0042 (0.0073) time: 0.0727 data: 0.0100 max mem: 2617 Test: Total time: 0:00:03 (0.0791 s / it) Averaged stats: model_time: 0.0438 (0.0612) evaluator_time: 0.0042 (0.0073) Accumulating evaluation results... DONE (t=0.01s). Accumulating evaluation results... DONE (t=0.01s). IoU metric: bbox Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.693 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.984 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.861 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.487 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.705 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.269 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.751 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.752 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.690 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.757 IoU metric: segm Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.729 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.984 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.908 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.465 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.741 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.285 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.761 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.762 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.720 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.765 Epoch: [1] [ 0/60] eta: 0:00:14 lr: 0.005000 loss: 0.4866 (0.4866) loss_classifier: 0.0831 (0.0831) loss_box_reg: 0.1846 (0.1846) loss_mask: 0.1979 (0.1979) loss_objectness: 0.0067 (0.0067) loss_rpn_box_reg: 0.0144 (0.0144) time: 0.2358 data: 0.0269 max mem: 2617 Epoch: [1] [10/60] eta: 0:00:10 lr: 0.005000 loss: 0.3620 (0.3528) loss_classifier: 0.0384 (0.0446) loss_box_reg: 0.1061 (0.1167) loss_mask: 0.1792 (0.1806) loss_objectness: 0.0016 (0.0029) loss_rpn_box_reg: 0.0060 (0.0080) time: 0.2179 data: 0.0169 max mem: 2617 Epoch: [1] [20/60] eta: 0:00:08 lr: 0.005000 loss: 0.3022 (0.3171) loss_classifier: 0.0344 (0.0421) loss_box_reg: 0.0869 (0.0994) loss_mask: 0.1628 (0.1668) loss_objectness: 0.0007 (0.0021) loss_rpn_box_reg: 0.0040 (0.0067) time: 0.2094 data: 0.0151 max mem: 2617 Epoch: [1] [30/60] eta: 0:00:06 lr: 0.005000 loss: 0.2544 (0.3038) loss_classifier: 0.0344 (0.0405) loss_box_reg: 0.0614 (0.0918) loss_mask: 0.1408 (0.1630) loss_objectness: 0.0010 (0.0023) loss_rpn_box_reg: 0.0040 (0.0063) time: 0.2012 data: 0.0147 max mem: 2617 Epoch: [1] [40/60] eta: 0:00:04 lr: 0.005000 loss: 0.2443 (0.2939) loss_classifier: 0.0335 (0.0383) loss_box_reg: 0.0602 (0.0869) loss_mask: 0.1466 (0.1608) loss_objectness: 0.0008 (0.0019) loss_rpn_box_reg: 0.0050 (0.0059) time: 0.2027 data: 0.0152 max mem: 2744 Epoch: [1] [50/60] eta: 0:00:02 lr: 0.005000 loss: 0.2443 (0.2886) loss_classifier: 0.0328 (0.0379) loss_box_reg: 0.0602 (0.0852) loss_mask: 0.1474 (0.1579) loss_objectness: 0.0007 (0.0018) loss_rpn_box_reg: 0.0043 (0.0058) time: 0.2159 data: 0.0163 max mem: 2744 Epoch: [1] [59/60] eta: 0:00:00 lr: 0.005000 loss: 0.2639 (0.2840) loss_classifier: 0.0319 (0.0380) loss_box_reg: 0.0690 (0.0834) loss_mask: 0.1397 (0.1553) loss_objectness: 0.0010 (0.0018) loss_rpn_box_reg: 0.0043 (0.0056) time: 0.2128 data: 0.0158 max mem: 2744 Epoch: [1] Total time: 0:00:12 (0.2093 s / it) creating index... index created! Test: [ 0/50] eta: 0:00:03 model_time: 0.0474 (0.0474) evaluator_time: 0.0079 (0.0079) time: 0.0671 data: 0.0114 max mem: 2744 Test: [49/50] eta: 0:00:00 model_time: 0.0392 (0.0418) evaluator_time: 0.0030 (0.0048) time: 0.0569 data: 0.0101 max mem: 2744 Test: Total time: 0:00:02 (0.0572 s / it) Averaged stats: model_time: 0.0392 (0.0418) evaluator_time: 0.0030 (0.0048) Accumulating evaluation results... DONE (t=0.01s). Accumulating evaluation results... DONE (t=0.01s). IoU metric: bbox Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.755 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.988 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.901 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.585 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.760 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.303 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.807 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.807 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.770 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.810 IoU metric: segm Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.738 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.988 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.930 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.396 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.751 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.290 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.774 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.776 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.730 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.780 That's it!
So after one epoch of training, we obtain a COCO-style mAP > 50, and a mask mAP of 65.
But what do the predictions look like? Let’s take one image in the dataset and verify
importmatplotlib.pyplotasplt fromtorchvision.utilsimport draw_bounding_boxes , draw_segmentation_masks image = read_image ("data/PennFudanPed/PNGImages/FudanPed00046.png") eval_transform = get_transform(train=False) model.eval () with torch.no_grad (): x = eval_transform (image ) # convert RGBA -> RGB and move to device x = x [:3, ...].to(device ) predictions = model([x , ]) pred = predictions[0] image = (255.0 * (image - image .min()) / (image .max() - image .min())).to(torch.uint8 ) image = image [:3, ...] pred_labels = [f"pedestrian: {score:.3f}" for label, score in zip(pred["labels"], pred["scores"])] pred_boxes = pred["boxes"].long() output_image = draw_bounding_boxes (image , pred_boxes , pred_labels, colors="red") masks = (pred["masks"] > 0.7).squeeze(1) output_image = draw_segmentation_masks (output_image , masks , alpha=0.5, colors="blue") plt.figure(figsize=(12, 12)) plt.imshow(output_image .permute(1, 2, 0))
<matplotlib.image.AxesImage object at 0x7f2964aa1180>
The results look good!
Wrapping up#
In this tutorial, you have learned how to create your own training
pipeline for object detection models on a custom dataset. For
that, you wrote a torch.utils.data.Dataset
class that returns the
images and the ground truth boxes and segmentation masks. You also
leveraged a Mask R-CNN model pre-trained on COCO train2017 in order to
perform transfer learning on this new dataset.
For a more complete example, which includes multi-machine / multi-GPU
training, check references/detection/train.py
, which is present in
the torchvision repository.
Total running time of the script: (0 minutes 46.636 seconds)