This blog post was originally published at Intel's website. It is reprinted here with the permission of Intel.
An Intel marketing team recently approached the Intel Science and Technology Center for Visual Cloud Systems with a request. They were looking for traffic video clips to use at trade shows and in demos and wondered if we could provide some from our Smart City Testbed in Pittsburgh. As we delved a little deeper on their requirements, though, the problem became more complex. They wanted a selection from a variety of points of view and locations. There were also restrictions on content licensing and the presence of personally identifiable information (e.g., faces) in the videos. Our small Pittsburgh deployment was not going to be a sufficient source for this data. We turned to one of the largest public video datasets available – the Yahoo/Flickr 100 million (YF100M) dataset which includes about 800,000 videos. We decided to use the combination of Scanner and the Visual Data Management System (VDMS) with Apache Spark, Pandas and Apache Arrow to help select usable videos from YF100M and to create a proof of concept around Big Visual Data Analytics.
What is Big Visual Data Analytics?
Over the last few years, Big Data has come to subsume many types of structured and unstructured, traditional and artificial intelligence (AI)-oriented data analytic approaches. Meanwhile, Video Analytics generally means traditional and AI-oriented computer vision and image analysis techniques on video data. The mashup of these two approaches is what I mean by Big Visual Data Analytics:
- Big – Almost by definition, visual data is big. A single two hour 4k 60fps video can consume a half terabyte of disk space depending on encoding scheme. YF100M includes 800,000 videos, although most are short and relatively low quality. The Brown Institute dataset that I talked about previously includes 200,000 longer videos. As you scale the datasets and associated application and video metadata, the full scale of a video dataset becomes apparent.
- Visual – In addition to the video content itself, other visually oriented data are often created and used in applications. This data can include individual images and transformed videos, video processing data like motion vectors, AI related data like neural network feature vectors, derived abstractions like bounding boxes and segmentation maps and generative data like textures and object meshes. All of this data can be complex and expensive to manage and analyze.
- Data – Combining the visual data with metadata makes the full scope of the data problem for a big visual data scientist. There can be substantial metadata associated with the video itself – where and when it was taken, who and what is in it, what its license terms and encoding parameters are, etc. Frequently, applications may also want to analyze non-visual data with the visual data. In a smart city case, for example, I may want to correlate traffic signal timing data with the number of vehicles a traffic camera detects on each signal cycle.
- Analytics – It is tempting to summarize analytics as “deriving insight from visual data and its associated metadata.” While that’s not bad, there are applications that stretch the boundaries of this definition. At CVPR this year, we showed a demo that used Scanner and VDMS together to generate skeletal reconstructions and player poses on 36 synchronized video streams from a soccer game. These poses were then used to create higher quality player reconstructions in the volumetric rendering pipeline implemented by the Intel Sports Group. The poses are an “insight” but the ultimate value was the incorporation in the pipeline. This “insight to operations” step is also evident in surveillance applications where, for example, license plate recognition can generate a law enforcement dispatch.
So, our marketing team’s request brought us a good real world example of big visual data analytics. Here’s how we met their challenge.
The YF100M Dataset
As mentioned above, we used the YF100M core dataset from the Multimedia Commons Initiative* for the proof of concept. The dataset has 99.2 million images and approximately 800,000 videos. The videos add up to around 8,081 hours, with an average video length of 37s and a median length of 28s. YF100M is partially tagged and is released under Creative Commons terms. In absolute terms, the “out-of-the-box” physical size of the dataset is truthfully not “big” – 15GB of videos and images with a 334 MB metadata file but it is one of the best video datasets publicly available. The individual videos are released under 6 different Creative Commons license types, but only one of them was acceptable for marketing purposes.
Proof of Concept Requirements and Architecture
Our task was to select a subset of the videos from the dataset according to the following criteria:
- Filter to work only with videos with the authorized license. This step came first to assure that we didn’t risk accidental inclusion of unacceptable videos.
- Find videos in that subset that are tagged with the keyword “traffic”.
- Identify videos in that subset that are encoded as H.264. Our downstream processing engine cannot handle VP9 encoded files which are present in small numbers in the dataset.
- Detect faces and license plates in that subset and apply a blurring filter to that region of the video.
- Stretch goals included identifying the point of view of the video (e.g., from above, from a street camera, from a moving vehicle). We’re still working on this….
The architecture we implemented is shown below and includes three primary subsystems: Apache Spark, Scanner and VDMS. The application logic runs on a combined Spark, Scanner and VDMS client. That application uses the respective APIs and data formats compatible with the underlying subsystems. Each subsystem was instantiated using separate nodegroups within a common Amazon Elastic Kubernetes Service (EKS) cluster to enable us to use different Intel® Xeon® Scalable processor based instance types for each subsystem-specific workload. The shared datastores are fully accessible from the client and from each of the subsystems. The metadata datastore uses Apache Arrow and parquet files to share data between subsystems. The visual and metadata shared datastores were implemented on Amazon Elastic File System (EFS). The YF100M dataset itself was accessed directly from the public YF100m S3 bucket.
We implemented the Spark functionality using spark-submit python queries to the cluster but, for simplicity, we did most of the data cleaning and filtering (e.g., selecting “traffic” tagged videos) with pandas on the client. These functions could have been straightforwardly done on Spark but we didn’t require its scale out capabilities for this workflow. In a more complex analytics task (e.g., using bounding boxes and feature vectors to identify similar faces across videos), the parallelization provided by Spark would be more valuable.
We used FFprobe to extract encoding related information for the candidate videos. Only H.264 encoded videos were retained in the candidate list.
We used Scanner and Intel® OpenVINO™ to find and blur faces and license plates. In order to make this task work, we added two custom Scanner blurring operators. We also plan to use Scanner to implement the more sophisticated point of view task I mentioned as a stretch goal above.
We used VDMS to store the metadata and derived visual data for the selected videos – making it easy to use that data for subsequent tasks without rerunning the underlying compute intensive analytics. For example, the legal team might request a more thorough blurring of faces than originally performed. VDMS allows just the face regions to be retrieved and reblurred without rerunning the face detection processing.
POC Result
Out of the 787,479 YF100M videos, 137,000 had appropriate licensing for our purpose. Of these, 152 were tagged as “traffic” videos. We ran ffprobe and filtered out non-H.264 to get 137 candidates. We then ran the blurring filter across these videos to produce the final content for review by the marketing team with the original request. Processing this end to end workflow on a three node kubernetes cluster took around 40 minutes with the face and license blurring stage taking the vast majority of the processing time.
Conclusion
While this workflow didn’t dramatically stress a system built on modern cloud services, it does demonstrate the types of big visual data analytics applications that are commercially interesting. And, it shows how these applications require the use of traditional data analytics tools like Spark and Pandas with visual data processing and management tools like Scanner and VDMS. The entire solution stack was built with open source software running in publicly available cloud infrastructure. That means that you can replicate the system for your own applications with almost no upfront investment – aside from learning how to become a big visual data analytics architect.
Jim Blakley
Visual Cloud Innovation Senior Director, Intel