{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import cv2\n", "from pathlib import Path\n", "import numpy as np\n", "# from PIL import Image\n", "import torch\n", "from torchvision.io.video import read_video\n", "import matplotlib.pyplot as plt\n", "from torchvision.utils import draw_bounding_boxes\n", "from torchvision.transforms.functional import to_pil_image\n", "from torchvision.models.detection import retinanet_resnet50_fpn_v2, RetinaNet_ResNet50_FPN_V2_Weights\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "source = Path('../DATASETS/VIRAT_subset_0102x')\n", "videos = source.glob('*.mp4')\n", "homography = list(source.glob('*img2world.txt'))[0]\n", "H = np.loadtxt(homography, delimiter=',')\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The homography matrix helps to transform points from image space to a flat world plane. The `README_homography.txt` from VIRAT describes:\n", "\n", "> Roughly estimated 3-by-3 homographies are included for convenience. \n", "> Each homography H provides a mapping from image coordinate to scene-dependent world coordinate.\n", "> \n", "> [xw,yw,zw]' = H*[xi,yi,1]'\n", "> \n", "> xi: horizontal axis on image with left top corner as origin, increases right.\n", "> yi: vertical axis on image with left top corner as origin, increases downward.\n", "> \n", "> xw/zw: world x coordinate\n", "> yw/zw: world y coordiante" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# H.dot(np.array([20,300, 1]))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "video_path = list(videos)[0]\n", "video_path = Path(\"../DATASETS/VIRAT_subset_0102x/VIRAT_S_010200_00_000060_000218.mp4\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('../DATASETS/VIRAT_subset_0102x/VIRAT_S_010200_00_000060_000218.mp4')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "video_path" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Suggestions from: https://stackabuse.com/retinanet-object-detection-with-pytorch-and-torchvision/" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "device(type='cuda')" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "device" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "weights = RetinaNet_ResNet50_FPN_V2_Weights.DEFAULT\n", "model = retinanet_resnet50_fpn_v2(weights=weights, score_thresh=0.35)\n", "model.to(device)\n", "# Put the model in inference mode\n", "model.eval()\n", "# Get the transforms for the model's weights\n", "preprocess = weights.transforms().to(device)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "# hub.set_dir()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "video = cv2.VideoCapture(str(video_path))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> The score_thresh argument defines the threshold at which an object is detected as an object of a class. Intuitively, it's the confidence threshold, and we won't classify an object to belong to a class if the model is less than 35% confident that it belongs to a class." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The result from a single prediction coming from `model(batch)` looks like:\n", "\n", "```python\n", "{'boxes': tensor([[5.7001e+02, 2.5786e+02, 6.3138e+02, 3.6970e+02],\n", " [5.0109e+02, 2.4508e+02, 5.5308e+02, 3.4852e+02],\n", " [3.4096e+02, 2.7015e+02, 3.6156e+02, 3.1857e+02],\n", " [5.0219e-01, 3.7588e+02, 9.7911e+01, 7.2000e+02],\n", " [3.4096e+02, 2.7015e+02, 3.6156e+02, 3.1857e+02],\n", " [8.3241e+01, 5.8410e+02, 1.7502e+02, 7.1743e+02]]),\n", " 'scores': tensor([0.8525, 0.6491, 0.5985, 0.4999, 0.3753, 0.3746]),\n", " 'labels': tensor([64, 64, 1, 64, 18, 86])}\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Now with SORT tracking\n", "\n", "Using a sort implementation originally by Alex Bewley, but adapted by [Chris Fotache](https://github.com/cfotache/pytorch_objectdetecttrack/blob/master/README.md). make into loop\n", "%matplotlib inline\n", "\n", "\n", "import pylab as pl\n", "from IPython import display\n", "from utils.timer import Timer\n", "\n", "i=0\n", "timer = Timer()\n", "while True:\n", " timer.tic()\n", " ret, frame = video.read()\n", " i+=1\n", " \n", " if not ret:\n", " print(\"Can't receive frame (stream end?). Exiting ...\")\n", " break\n", "\n", " t = torch.from_numpy(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))\n", " t.shape\n", " # image = image[np.newaxis, :] \n", " t = t.permute(2, 0, 1)\n", " t.shape\n", "\n", " batch = preprocess(t)[None, :].to(device)\n", " # no_grad can be used on inference, should be slightly faster\n", " with torch.no_grad():\n", " predictions = model(batch)\n", " prediction = predictions[0] # we feed only one frame at the once\n", "\n", " mask = prediction['labels'] == 1 # if we want more than one: np.isin(prediction['labels'], [1,86])\n", "\n", " scores = prediction['scores'][mask]\n", " labels = prediction['labels'][mask]\n", " boxes = prediction['boxes'][mask]\n", " \n", " # TODO: introduce confidence and NMS supression: https://github.com/cfotache/pytorch_objectdetecttrack/blob/master/PyTorch_Object_Tracking.ipynb\n", " # (which I _think_ we better do after filtering)\n", " # alternatively look at Soft-NMS https://towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c\n", "\n", " \n", " # dets - for track in tracks:
 # TODO add to tracked_instances
 track_id = str(int(track[4]))
 if track_id not in tracked_instances:
 tracked_instances[track_id] = []
 tracked_instances[track_id].append(track)

 
 # labels = [weights.meta["categories"][i] for i in labels]

 if display_image:
 box = draw_bounding_boxes(t, boxes=t_boxes,
 labels=labels,
 colors="cyan",
 width=2, 
 font_size=30,
 # font='Arial'
 )

 im = to_pil_image(box.detach())

 display.display(im, f"frame {i}")
 print(prediction)
 print("time for frame: ", timer.toc(), ", avg:", 1/timer.average_time, "fps")

 display.clear_output(wait=True)