Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads [pdf]
Sadjad Fouladi, et al.
Abstract: We describe ExCamera, a system that can edit, transform, and encode a video, including 4K and VR material, with low latency. The system makes two major contributions. First, we designed a framework to run general-purpose parallel computations on a commercial "cloud function" service. The system starts up thousands of threads in seconds and manages inter-thread communication. Second, we implemented a video encoder intended for fine-grained parallelism, using a functional-programming style that allows computation to be split into thousands of tiny tasks without harming compression efficiency. Our design reflects a key insight: the work of video encoding can be divided into fast and slow parts, with the "slow" work done in parallel, and only "fast" work done serially
Demo: you may have seen this guy walking around with a GoPro strapped to his head during the first day of NSDI'17. John is our guinea pig a first year PhD student at Stanford we enlisted to record his interactions at the conference. We scanned over all the faces that John saw using a deep neural network based face recognition package (OpenFace) that we deployed onto our AWS lambda supercomputing system, mu. Our goal was to find all of the times John saw our ExCamera collaborator, George Porter; we wanted to make sure he made it safely to Boston :). Once all the time slices of George were identified we quickly encoded a montage using our ExCamera system. The first video displayed below is the full 5.5 hour long video John took while at the conference and the second video is the montage we generated in just a few minutes on stage!
To perform the face recognition and stitch the montage together, our system performs the following steps:
Upload image with face of interest to S3 and perform standard image augmentation techniques to generate a training set.
Use a deep neural network (DNN) to locate and generate 128-dimensional feature vectors for the face in each augmented image in the training set.
Train a KNN classifier with (1) the augmented image feature vectors and (2) labeled faces in the wild (lfw) feature vectors.
Run the DNN featurizer and KNN classifier in parallel across the entire video using 3000+ AWS lambda workers to perform recognition.
Aggregate all the frames where the face of interest was recognized.
Launch ExCamera to encode the frames into a montage!