Inference on video without extracting images#641

Draft

cjaverliat wants to merge 24 commits intofacebookresearch:main from

cjaverliat:feat/generic-predictor

Draft

Inference on video without extracting images #641
cjaverliat wants to merge 24 commits intofacebookresearch:main from
cjaverliat:feat/generic-predictor

Conversation

@cjaverliat

Copy link

@cjaverliat cjaverliat commented May 3, 2025 •

edited

Loading

Inference on video without extracting images

This PR proposes an alternative to the original SAM2Base, SAM2Generic, which provides new APIs. Additionally, I added a SAM2GenericVideoPredictor which is a re-implementation of the video predictor but with configurable strategies for memorization and removal of past memories (cf. here for an example), this solves the issue with keeping everything in the VRAM.

More importantly, this gives more flexibility as to how the images are provided (as tensors instead of giving a video url or individual images):

import cv2
import torch
from tqdm import tqdm
from sam2.sam2_generic_video_predictor import Prompt
from sam2.build_sam import build_sam2_generic_video_predictor
sam2_checkpoint = "../checkpoints/sam2.1_hiera_base_plus.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_b+.yaml"
predictor = build_sam2_generic_video_predictor(model_cfg, sam2_checkpoint, device=device)
cap = cv2.VideoCapture("./videos/bedroom.mp4")
n_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
orig_hw = (height, width)
def read_frame(cap) -> torch.Tensor:
 ret, frame = cap.read()
 if not ret:
 return None
 frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
 frame = torch.as_tensor(frame).permute(2, 0, 1).to(device) # HWC -> CHW
 frame = frame / 255.0
 return frame
 
# Add a prompt on the first frame
initial_frame = read_frame(cap)
points_coords = torch.tensor([400.0, 150.0], device=device).reshape((1, 1, 2))
points_labels = torch.tensor([1], device=device).reshape((1, 1))
prompt = Prompt(obj_id=0, points_coords=points_coords, points_labels=points_labels)
results = predictor.forward(frame=initial_frame, object_prompts=[prompt])
for f in tqdm(range(1, n_frames)):
 frame = read_frame(cap)
 if frame is None:
 break
 results = predictor.forward(frame=frame)
 
 # Do something with the result, for example:
 # show_mask((results[0].best_mask_logits > 0), plt.gca(), obj_id=0)

The full usage example is available in the generic_video_predictor_example.ipynb notebook.

cjaverliat added 5 commits

May 3, 2025 13:37

@cjaverliat


 Add SAM2Generic class

42c0c18

@cjaverliat


 Variable renaming + docstring

3a4ce6e

@cjaverliat


 Add device transfer for empty prompt embeddings

6ae45ca

@cjaverliat


 Add generic video predictor

85255d7

@cjaverliat


 Fix formatting

7431eb9

@facebook-github-bot facebook-github-bot added the cla signed label

May 3, 2025

@cjaverliat cjaverliat marked this pull request as draft

May 3, 2025 17:11

cjaverliat added 19 commits

May 3, 2025 19:29

@cjaverliat


 Add build_sam2_generic

54bfe74

@cjaverliat


 Add autoscale when encoding uint8 images

df361a0

@cjaverliat


 Update assertion in condition_image_embeddings_on_memories

86e16e0

@cjaverliat


 Add SAM2Result containing masks_logits, ious, obj_ptrs and obj_score_...

ea32c69

...logits

@cjaverliat


 Add SAM2Prompt containing points, boxes and masks prompts

66436f2

@cjaverliat


 Add ObjectMemory, ObjectMemorySelection and ObjectMemoryBank

bceef89

@cjaverliat


 Implement SAM2 existing memory bank using new classes

4a9d730

@cjaverliat


 Remove unnecessary SAM2ObjectMemory

0ce42f9

@cjaverliat


 Add SAM2Result concatenation method

e3db8ee

@cjaverliat


 Add device property to SAM2Result

c65a20f

@cjaverliat


 Rename obj_score_logits to obj_scores_logits + add __getitem__ and ca...

b97448f

...t methods in SAM2Result

@cjaverliat


 Add missing implementation for data transfer in ObjectMemory

6657bed

@cjaverliat


 Rename object_memories to ptr_memories

d197aa1

@cjaverliat


 Update ObjectMemoryBank try_add_memories function

7b94cf9

@cjaverliat


 Add method to count the number of stored memories in the memory bank

7dfdb9e

@cjaverliat


 Modify sam2_generic and its video predictor to reflect changes on the...

3c4dcd8

... memory bank

@cjaverliat


 Fix best_mask_logits indexing issue

1cb6a77

@cjaverliat


 Reshape binarize to be broadcastable

53f5292

@cjaverliat


 Update notebook

5b0ed20

@hjj-lmx

Copy link

hjj-lmx commented Aug 11, 2025 •

edited

Loading

Prompt

为什么我的gpu只占用了不到3G,推理的时候感觉没有用上gpu,使用率一直为0,速度很慢,四分钟视频七千多帧,要跑四十几分钟

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference on video without extracting images#641

Inference on video without extracting images #641
cjaverliat wants to merge 24 commits intofacebookresearch:main from
cjaverliat:feat/generic-predictor

Conversation

@cjaverliat cjaverliat commented May 3, 2025 •

edited

Loading

Uh oh!

Inference on video without extracting images

Uh oh!

hjj-lmx commented Aug 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

@cjaverliat cjaverliat commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Inference on video without extracting images

Uh oh!

hjj-lmx commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@cjaverliat cjaverliat commented May 3, 2025 •

edited

Loading

hjj-lmx commented Aug 11, 2025 •

edited

Loading