{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import cv2\n", "from pathlib import Path\n", "import numpy as np\n", "# from PIL import Image\n", "import torch\n", "from torchvision.io.video import read_video\n", "import matplotlib.pyplot as plt\n", "from torchvision.utils import draw_bounding_boxes\n", "from torchvision.transforms.functional import to_pil_image\n", "from torchvision.models.detection import retinanet_resnet50_fpn_v2, RetinaNet_ResNet50_FPN_V2_Weights\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "source = Path('../DATASETS/VIRAT_subset_0102x')\n", "videos = source.glob('*.mp4')\n", "homography = list(source.glob('*img2world.txt'))[0]\n", "H = np.loadtxt(homography, delimiter=',')\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The homography matrix helps to transform points from image space to a flat world plane. The `README_homography.txt` from VIRAT describes:\n", "\n", "> Roughly estimated 3-by-3 homographies are included for convenience. \n", "> Each homography H provides a mapping from image coordinate to scene-dependent world coordinate.\n", "> \n", "> [xw,yw,zw]' = H*[xi,yi,1]'\n", "> \n", "> xi: horizontal axis on image with left top corner as origin, increases right.\n", "> yi: vertical axis on image with left top corner as origin, increases downward.\n", "> \n", "> xw/zw: world x coordinate\n", "> yw/zw: world y coordiante" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# H.dot(np.array([20,300, 1]))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "video_path = list(videos)[0]\n", "video_path = Path(\"../DATASETS/VIRAT_subset_0102x/VIRAT_S_010200_00_000060_000218.mp4\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('../DATASETS/VIRAT_subset_0102x/VIRAT_S_010200_00_000060_000218.mp4')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "video_path" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Suggestions from: https://stackabuse.com/retinanet-object-detection-with-pytorch-and-torchvision/" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "device(type='cuda')" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "device" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "weights = RetinaNet_ResNet50_FPN_V2_Weights.DEFAULT\n", "model = retinanet_resnet50_fpn_v2(weights=weights, score_thresh=0.35)\n", "model.to(device)\n", "# Put the model in inference mode\n", "model.eval()\n", "# Get the transforms for the model's weights\n", "preprocess = weights.transforms().to(device)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "# hub.set_dir()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "video = cv2.VideoCapture(str(video_path))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> The score_thresh argument defines the threshold at which an object is detected as an object of a class. Intuitively, it's the confidence threshold, and we won't classify an object to belong to a class if the model is less than 35% confident that it belongs to a class." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The result from a single prediction coming from `model(batch)` looks like:\n", "\n", "```python\n", "{'boxes': tensor([[5.7001e+02, 2.5786e+02, 6.3138e+02, 3.6970e+02],\n", " [5.0109e+02, 2.4508e+02, 5.5308e+02, 3.4852e+02],\n", " [3.4096e+02, 2.7015e+02, 3.6156e+02, 3.1857e+02],\n", " [5.0219e-01, 3.7588e+02, 9.7911e+01, 7.2000e+02],\n", " [3.4096e+02, 2.7015e+02, 3.6156e+02, 3.1857e+02],\n", " [8.3241e+01, 5.8410e+02, 1.7502e+02, 7.1743e+02]]),\n", " 'scores': tensor([0.8525, 0.6491, 0.5985, 0.4999, 0.3753, 0.3746]),\n", " 'labels': tensor([64, 64, 1, 64, 18, 86])}\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Now with SORT tracking\n", "\n", "Using a sort implementation originally by Alex Bewley, but adapted by [Chris Fotache](https://github.com/cfotache/pytorch_objectdetecttrack/blob/master/README.md). For an example implementation, see [his notebook](https://github.com/cfotache/pytorch_objectdetecttrack/blob/master/PyTorch_Object_Tracking.ipynb).\n", "\n" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "from sort_cfotache import Sort\n", "\n", "mot_tracker = Sort()\n", "\n", "display_image = True" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "tracked_instances = {}" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "ename": "KeyboardInterrupt", "evalue": "", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[58], line 29\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[39m# no_grad can be used on inference, should be slightly faster\u001b[39;00m\n\u001b[1;32m 28\u001b[0m \u001b[39mwith\u001b[39;00m torch\u001b[39m.\u001b[39mno_grad():\n\u001b[0;32m---> 29\u001b[0m predictions \u001b[39m=\u001b[39m model(batch)\n\u001b[1;32m 30\u001b[0m prediction \u001b[39m=\u001b[39m predictions[\u001b[39m0\u001b[39m] \u001b[39m# we feed only one frame at the once\u001b[39;00m\n\u001b[1;32m 32\u001b[0m mask \u001b[39m=\u001b[39m prediction[\u001b[39m'\u001b[39m\u001b[39mlabels\u001b[39m\u001b[39m'\u001b[39m] \u001b[39m==\u001b[39m \u001b[39m1\u001b[39m \u001b[39m# if we want more than one: np.isin(prediction['labels'], [1,86])\u001b[39;00m\n", "File \u001b[0;32m~/suspicion/trajpred/.venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1501\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1496\u001b[0m \u001b[39m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m 1497\u001b[0m \u001b[39m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m 1498\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m (\u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_backward_hooks \u001b[39mor\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_backward_pre_hooks \u001b[39mor\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_forward_hooks \u001b[39mor\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m 1499\u001b[0m \u001b[39mor\u001b[39;00m _global_backward_pre_hooks \u001b[39mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m 1500\u001b[0m \u001b[39mor\u001b[39;00m _global_forward_hooks \u001b[39mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0m \u001b[39mreturn\u001b[39;00m forward_call(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m 1502\u001b[0m \u001b[39m# Do not call functions when jit is used\u001b[39;00m\n\u001b[1;32m 1503\u001b[0m full_backward_hooks, non_full_backward_hooks \u001b[39m=\u001b[39m [], []\n", "File \u001b[0;32m~/suspicion/trajpred/.venv/lib/python3.9/site-packages/torchvision/models/detection/retinanet.py:663\u001b[0m, in \u001b[0;36mRetinaNet.forward\u001b[0;34m(self, images, targets)\u001b[0m\n\u001b[1;32m 660\u001b[0m split_anchors \u001b[39m=\u001b[39m [\u001b[39mlist\u001b[39m(a\u001b[39m.\u001b[39msplit(num_anchors_per_level)) \u001b[39mfor\u001b[39;00m a \u001b[39min\u001b[39;00m anchors]\n\u001b[1;32m 662\u001b[0m \u001b[39m# compute the detections\u001b[39;00m\n\u001b[0;32m--> 663\u001b[0m detections \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mpostprocess_detections(split_head_outputs, split_anchors, images\u001b[39m.\u001b[39;49mimage_sizes)\n\u001b[1;32m 664\u001b[0m detections \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mtransform\u001b[39m.\u001b[39mpostprocess(detections, images\u001b[39m.\u001b[39mimage_sizes, original_image_sizes)\n\u001b[1;32m 666\u001b[0m \u001b[39mif\u001b[39;00m torch\u001b[39m.\u001b[39mjit\u001b[39m.\u001b[39mis_scripting():\n", "File \u001b[0;32m~/suspicion/trajpred/.venv/lib/python3.9/site-packages/torchvision/models/detection/retinanet.py:531\u001b[0m, in \u001b[0;36mRetinaNet.postprocess_detections\u001b[0;34m(self, head_outputs, anchors, image_shapes)\u001b[0m\n\u001b[1;32m 529\u001b[0m scores_per_level \u001b[39m=\u001b[39m torch\u001b[39m.\u001b[39msigmoid(logits_per_level)\u001b[39m.\u001b[39mflatten()\n\u001b[1;32m 530\u001b[0m keep_idxs \u001b[39m=\u001b[39m scores_per_level \u001b[39m>\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mscore_thresh\n\u001b[0;32m--> 531\u001b[0m scores_per_level \u001b[39m=\u001b[39m scores_per_level[keep_idxs]\n\u001b[1;32m 532\u001b[0m topk_idxs \u001b[39m=\u001b[39m torch\u001b[39m.\u001b[39mwhere(keep_idxs)[\u001b[39m0\u001b[39m]\n\u001b[1;32m 534\u001b[0m \u001b[39m# keep only topk scoring predictions\u001b[39;00m\n", "\u001b[0;31mKeyboardInterrupt\u001b[0m: " ] } ], "source": [ "# TODO make into loop\n", "%matplotlib inline\n", "\n", "\n", "import pylab as pl\n", "from IPython import display\n", "from utils.timer import Timer\n", "\n", "i=0\n", "timer = Timer()\n", "while True:\n", " timer.tic()\n", " ret, frame = video.read()\n", " i+=1\n", " \n", " if not ret:\n", " print(\"Can't receive frame (stream end?). Exiting ...\")\n", " break\n", "\n", " t = torch.from_numpy(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))\n", " t.shape\n", " # image = image[np.newaxis, :] \n", " t = t.permute(2, 0, 1)\n", " t.shape\n", "\n", " batch = preprocess(t)[None, :].to(device)\n", " # no_grad can be used on inference, should be slightly faster\n", " with torch.no_grad():\n", " predictions = model(batch)\n", " prediction = predictions[0] # we feed only one frame at the once\n", "\n", " mask = prediction['labels'] == 1 # if we want more than one: np.isin(prediction['labels'], [1,86])\n", "\n", " scores = prediction['scores'][mask]\n", " labels = prediction['labels'][mask]\n", " boxes = prediction['boxes'][mask]\n", " \n", " # TODO: introduce confidence and NMS supression: https://github.com/cfotache/pytorch_objectdetecttrack/blob/master/PyTorch_Object_Tracking.ipynb\n", " # (which I _think_ we better do after filtering)\n", " # alternatively look at Soft-NMS https://towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c\n", "\n", " \n", " # dets - a numpy array of detections in the format [[x1,y1,x2,y2,score],[x1,y1,x2,y2,score],...]\n", " detections = np.array([np.append(bbox, [score, label]) for bbox, score, label in zip(boxes.cpu(), scores.cpu(), labels.cpu())])\n", " # print(detections)\n", " tracks = mot_tracker.update(detections)\n", "\n", " # now convert back to boxes and labels\n", " # print(tracks)\n", " boxes = np.array([t[:4] for t in tracks])\n", " # initialize empty with the necesserary dimensions for drawing_bounding_boxes glitch\n", " t_boxes = torch.from_numpy(boxes) if len(boxes) else torch.Tensor().new_empty([0, 6])\n", " labels = [str(int(t[4])) for t in tracks]\n", " # print(t_boxes, boxes, labels)\n", "\n", "\n", " for track in tracks:\n", " # TODO add to tracked_instances\n", " track_id = str(int(track[4]))\n", " if track_id not in tracked_instances:\n", " tracked_instances[track_id] = []\n", " tracked_instances[track_id].append(track)\n", "\n", " \n", " # labels = [weights.meta[\"categories\"][i] for i in labels]\n", "\n", " if display_image:\n", " box = draw_bounding_boxes(t, boxes=t_boxes,\n", " labels=labels,\n", " colors=\"cyan\",\n", " width=2, \n", " font_size=30,\n", " # font='Arial'\n", " )\n", "\n", " im = to_pil_image(box.detach())\n", "\n", " display.display(im, f\"frame {i}\")\n", " print(prediction)\n", " print(\"time for frame: \", timer.toc(), \", avg:\", 1/timer.average_time, \"fps\")\n", "\n", " display.clear_output(wait=True)\n", "\n", " # break # for now\n", " # pl.clf()\n", " # # pl.plot(pl.randn(100))\n", " # pl.figure(figsize=(24,50))\n", " # # fig.axes[0].imshow(img)\n", " # pl.imshow(im)\n", " # display.display(pl.gcf(), f\"frame {i}\")\n", " # display.clear_output(wait=True)\n", " # time.sleep(1.0)\n", "\n", " # fig, ax = plt.subplots(figsize=(16, 12))\n", " # ax.imshow(im)\n", " # plt.show()\n", "\n" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['22', '24', '26', '27', '30', '31', '32', '33', '37'])" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([[5.30405334e+02, 5.34641296e+02, 6.03237061e+02, 7.18612122e+02,\n", " 9.42070127e-01, 1.00000000e+00],\n", " [4.61479340e+02, 5.49811340e+02, 5.34607056e+02, 7.17237122e+02,\n", " 9.26090062e-01, 1.00000000e+00],\n", " [3.38673218e+02, 2.55078461e+02, 3.57062561e+02, 2.95217896e+02,\n", " 6.61470771e-01, 1.00000000e+00]]),)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'17': [array([573.00909697, 551.76122438, 657.56378982, 720.05069192,\n", " 17. , 1. ]),\n", " array([570.16715738, 550.85464258, 652.59986304, 719.88004284,\n", " 17. , 1. ]),\n", " array([568.02909891, 550.10706805, 649.96206622, 720.03113806,\n", " 17. , 1. ]),\n", " array([562.49451695, 549.06638446, 644.29895964, 720.04103925,\n", " 17. , 1. ])],\n", " '13': [array([337.63475088, 255.66774475, 355.97561492, 296.69147428,\n", " 13. , 1. ]),\n", " array([337.77042983, 255.72223676, 356.05113319, 296.63698388,\n", " 13. , 1. ]),\n", " array([338.02427059, 255.89595935, 356.25536645, 296.58306741,\n", " 13. , 1. ]),\n", " array([338.1632419 , 255.82719651, 356.27227032, 296.33234513,\n", " 13. , 1. ])],\n", " '12': [array([481.57704931, 568.79192296, 570.79284909, 718.23349465,\n", " 12. , 1. ]),\n", " array([479.96268827, 569.31456975, 567.89464999, 718.91657277,\n", " 12. , 1. ]),\n", " array([478.23383288, 568.93539717, 565.05653529, 718.92571522,\n", " 12. , 1. ]),\n", " array([475.43950486, 567.4295262 , 561.46362594, 718.3620136 ,\n", " 12. , 1. ])]}" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "1135f674f58caf91385e41dd32dc418daf761a3c5d4526b1ac3bad0b893c2eb5" } } }, "nbformat": 4, "nbformat_minor": 2 }