Video Scene Detection and Classification: PySceneDetect, Places365 and Mozilla DeepSpeech Engine

5 min readJul 17, 2020

Video is an electronic method of delineating movable objects. Today, it is utilized for almost everything from entertainment and marketing to broadcast, knowledge-sharing and social journalism. Scene detection can help purchasers and ventures utilize the video content afresh.

Video scene detection or segmentation is a method of splitting videos into clips or scenes which are semantically or visually related or similar. A scene can be considered as a sequence or streams of small-scale vid which represents some sort of noble ideas or stories, where small-scale vid is a chain of frames or images. In business enterprises videos play a vital role, for example, in real estate inspection many companies are using deep learning based video scene detection for home scrutiny. The clear-cut scenes may consist of a room with contorted / misaligned walls, floors & roofs or Inappropriate water proofing measure, active water entry and leaks. In a recorded lectures, a scene may be described via a precise subject matter that was once discussed.

PySceneDetect

PySceneDetect is a command-line application and a Python library for detecting scene changes in videos, and automatically splitting the video into separate clips. It is a free and open-source software (FOSS) with a feature of having several detection methods from simple threshold-based fade in/out detection, to advanced content aware fast-cut detection of each shot.

There are two main detection methods PySceneDetect uses: detect-threshold (comparing each frame to a set black level, useful for detecting cuts and fades to/from black), and detect-content (compares each frame sequentially looking for changes in content, useful for detecting fast cuts between video scenes, although slower to process).

Places365 Model

Places365 Model is a convolutional neural networks (CNNs) trained on Places365 dataset. Places365 is the latest subset of Places2 Database. There are two versions of Places365: Places365-Standard and Places365-Challenge. The train set of Places365-Standard has ~1.8 million images from 365 scene categories, where there are at most 5000 images per category. Various baseline CNNs have been trained on the Places365-Standard. Meanwhile, the train set of Places365-Challenge has extra 6.2 million images along with all the images of Places365-Standard (so totally ~8 million images), where there are at most 40,000 images per category.

Mozilla DeepSpeech Speech to Text Engine

Data flows from the audio input to feature computation, through three fully connected layers.

Mozilla DeepSpeech, an automatic speech recognition (ASR) engine which aims to make speech recognition technology and trained models openly available to developers. It is an open source Speech-To-Text engine, using a model trained by machine learning techniques to provide speech recognition almost as accurate as humans.

DeepSpeech is composed of two main subsystems: an acoustic model and a decoder. The acoustic model is a deep neural network that receives audio features as inputs, and outputs character probabilities. The decoder uses a beam search algorithm to transform the character probabilities into textual transcripts that are then returned by the system.

This blog is about building an application for detecting different scenes from video frames and then classifying them into a house room categories such as Mezzanine, Laundromat, Bedroom, Kitchen, Nursery, Dorm room, etc. And splitting these video into smaller scenes using ffmpeg and then performing Speech to Text task using Mozilla DeepSpeech Engine for verifying the Classification output.

Available Cloud Based Solution:
1. AWS Rekognition (Preferred)
2. Azure Video Indexer

In this blog, I am explaining the below mentioned steps, which I followed to build the application :
1. Download Video from YouTube using python pytube package.(Optional)
2. Processing the Video using PySceneDetect to perform scene detection and saving the output frames and scenes information in CSV such as start and end Timecode, start and end Frame, length, etc.
3. Removing Images which has more than 90% similar content.
4. Running Places365 classification Model trained on resnet50, on these Scene Images.
5. Splitting scene video based on the information present in CSV generated by PySceneDetect.
6. Preprocessing these splitted video before running Speech to Text Engine.
7. Performing DeepSpeech Engine on all these Video to extract Text data from videos.

Step 1:

>>> from pytube import YouTube
>>> YouTube('https://www.youtube.com/watch?v=cKbF74R6Vqs').streams[0].download()

Step 2:

>>> scenedetect --input my_video.mp4 --output my_video_scenes --stats my_video.stats.csv detect-content list-scenes save-images

Step 3:

This function takes 2 images parameter and gives a similarity score and based on the structure of the images, based on thresold value you can delete it or not.

import cv2
def sift_sim(path_a, path_b):
  '''
  Use SIFT features to measure image similarity
  @args:
    {str} path_a: the path to an image file
    {str} path_b: the path to an image file
  @returns:
    TODO
  '''
  # initialize the sift feature detector
  orb = cv2.ORB_create()# get the images
  img_a = cv2.imread(path_a)
  img_b = cv2.imread(path_b)# find the keypoints and descriptors with SIFT
  kp_a, desc_a = orb.detectAndCompute(img_a, None)
  kp_b, desc_b = orb.detectAndCompute(img_b, None)# initialize the bruteforce matcher
  bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)# match.distance is a float between {0:100} - lower means more similar
  matches = bf.match(desc_a, desc_b)
  similar_regions = [i for i in matches if i.distance < 70]
  if len(matches) == 0:
    return 0
  return len(similar_regions) / len(matches)

Step4:

This run_placesCNN_basic.py file is available at Github link. Just change your Image path and choose your model accordingly.

Run basic code to get the scene prediction from PlacesCNN:

python run_placesCNN_basic.py

RESULT ON http://places.csail.mit.edu/demo/5.jpg
0.238 -> artists_loft
0.176 -> art_gallery
0.091 -> art_studio
0.081 -> playroom
0.043 -> art_school

or run unified code to predict scene categories, indoor/outdoor type, scene attributes, and the class activation map together from PlacesCNN:

python run_placesCNN_unified.py

RESULT ON http://places.csail.mit.edu/demo/7.jpg
--TYPE OF ENVIRONMENT: indoor
--SCENE CATEGORIES:
0.525 -> childs_room
0.292 -> dorm_room
0.061 -> bedroom
0.036 -> bedchamber
0.023 -> youth_hostel
--SCENE ATTRIBUTES:
no horizon, man-made, enclosed area, wood, glass, glossy, reading, indoor lighting, cloth
Class activation map is saved as cam.jpg

Step 5:

The PySceneDetect generates output.csv file where all scenes are listed down with details like start-time, end-time etc. Using these details extarct scene videos using FFMPEG

$ ffmpeg -i source-file.foo -ss 0 -t 600 first-10-min.m4v

Step 6:

Convert the Video to wav file so that we can feed it to Mozilla DeepSpeech Engine.

$ ffmpeg -i video.mp4 -f mp3 -ab 192000 -vn music.mp3

Step 7:

You need to install scorer and pbmm files, use this link for installation and usage.

deepspeech --model deepspeech-0.7.4-models.pbmm --scorer deepspeech-0.7.4-models.scorer --audio audio/2830-3980-0043.wav

Video Scene Detection and Classification: PySceneDetect, Places365 and Mozilla DeepSpeech Engine

Written by Khushboo .