Qualcomm AI Research Makes Diverse Datasets Available to Advance Machine Learning Research

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm.

Qualcomm AI Research has published a variety of datasets for research use. These datasets can be used to train models in the kinds of applications most common to mobile computing, including: advanced driver assistance systems (ADAS), extended reality (XR): VR and AR applications, smartphones, robotics, smart homes, security, industrial IoT, healthcare and assistive technology, keyword spotting and tensor program execution times.

Whether your applications depend on recognizing gestures, speech or images, you’ll find datasets you can use for training in machine learning and artificial intelligence. Explore our published datasets.

Qualcomm Exercise Video Dataset (QEVD)

The Qualcomm Exercise Video Dataset (QEVD) explores human-AI interaction in the challenging real-world domain of fitness coaching – a task which intrinsically requires monitoring live user activity and providing timely feedback. Our dataset includes corrective feedback to address potential user mistakes and steer them towards successful workout completion.

The dataset contains 474+ hours of videos and includes the following:

short-clip videos (∼5 seconds in length) annotated with 1M+ question-answer pairs
short-clip videos (∼5 seconds in length) annotated with 650k+ live feedbacks including corrective feedbacks
long-range videos (>3 minutes in length) annotated with 7.5k+ live feedbacks including corrective feedbacks

ClevrSkills

ClevrSkills hosts ~300,000 robot episodes/trajectories including videos (from multiple views), corresponding actions, and other annotations including text, bounding boxes, cameras poses etc., which were generated from over 33 tasks in the ClevrSkills environment suite (available here).

The data set also provides a carefully designed curriculum of tasks which can be used for training robotics models to perform tasks ranging from simple pick and place to more complicated manipulation, such as sorting, stacking etc.

AirLetters

The AirLetters dataset is a benchmark for evaluating a model’s ability to classify articulated motions. This large collection of over 161,000 video-label pairs of video clips, shows humans drawing letters and digits in the air, and is used to evaluate a model’s ability to classify articulated motions correctly.

Unlike existing video datasets, AirLetters’ accurate classification predictions rely on discerning motion patterns and integrating information presented by the video over time (i.e., over many frames of video).

Our study revealed that while trivial for humans, accurate representations of complex articulated motions remain an open problem for end-to-end learning for models.

PlotTwist

The PlotTwist dataset enables you to study visual reasoning with vision-language models (VLMs) to answer human-readable questions about images of mathematical graphs.

You can test with over 2,900 high-resolution image-data pairs (benchmark data), where each pair contains:

A graph/plot of a mathematical function.
Data, including a question about the image and the correct answer (e.g., “which subplot has the largest number of discontinuous functions?”).

The PlotTwist dataset also includes a training set of over 226,000 image-data pairs for re-training and fine-tuning the model.

Tasks are categorized into three levels of increasing difficulty:

Single-function
Multi-function
Multi-plot (the most challenging)

Jester Dataset

Your model recognizes certain simple, single-frame gestures like a thumbs-up. But for a truly responsive, accurate system, you want your model to recognize complex gestures too, even when the differences between them are subtle. Is the person pointing to something or wagging their index finger? Is the hand cleaning the display or rubber-banding an image with two fingers? Given enough examples, your model can learn the difference.

The Jester gesture recognition dataset includes 148,092 labeled video clips of humans performing basic, pre-defined hand gestures in front of a laptop camera or webcam. It is designed for training machine learning models to recognize human hand gestures like sliding two fingers down, swiping left or right and drumming fingers.

The clips cover 27 different classes of human hand gestures, split in the ratio of 8:1:1 for training, development and testing. The dataset also includes two “no gesture” classes to help the network distinguish between specific gestures and unknown hand movements.

In the age of mobile computing, gesture/action recognition and its role in human-computer interfaces have grown in importance. The Jester video dataset allows the training of robust machine learning models to recognize human hand gestures.

Something-Something v2 Dataset

The Something-Something dataset (version 2) is a collection of 220,847 labeled video clips of humans performing pre-defined, basic actions with everyday objects. It is designed to train machine learning models in fine-grained understanding of human hand gestures like putting something into something, turning something upside down and covering something with something.

CausalCircuit

CausalCircuit is a dataset designed to guide research into causal representation learning – the problem of identifying the high-level causal variables in an image together with the causal structure between them.

The dataset consists of images that show a robotic arm interacting with a system of buttons and lights. In this system, there are four causal variables describing the position of the robot arm along an arc as well as the intensities of red, green, and blue lights. The data is rendered as 512×512 images with MuJoCO, an open-source physics engine. For the robotic arm we use a model of the TriFinger platform, an open-source robot for learning dexterity.

The data consists of pairs of data: before and after an intervention took place. Each sample contains pairs of data corresponding to before and after the intervention.

Wireless Indoor Simulations

The Wireless Indoor Simulations dataset contains a large set of channels to enable a better understanding of the interplay between the propagation environment (e.g., materials, geometry) and corresponding channel effects (e.g., delay, receive power).

The dataset is in two parts: Wi3Rooms, where channels are simulated in various random configurations of 3-roomed indoor layouts in a 10⨉5⨉3m hull using the simulator PyLayers, and WiIndoor, where the 10⨉10⨉3m indoor configurations are based on the RPLAN dataset and typically contain 4-8 rooms. The simulations were performed using Wireless InSite with ray tracer X3D.

Massive MIMO Spatial Channel Model Dataset

This dataset contains samples of the frequency domain channel matrix for the channel between user equipment (UEs) and their corresponding serving cell, synthetically generated using the 3GPP Spatial Channel Model defined in TR 38.901. The data can be used to better understand the statistical distribution of channel characteristics in a typical dense urban layout.

Keyword Speech Dataset

Nowadays keyword spotting (KWS) is widely used for detecting specific keywords in personal devices like mobile phones and home appliances. A keyword may consist of multiple words, where “Hey Siri”, “Ok Google”, and “Hi Bixby” are well known examples.

Many keywords like these examples are branded by specific companies and the companies have shown great interests in KWS task for their own products. Various KWS approaches have been suggested by these companies, but they have exclusivity since they use their own keyword dataset that are not accessible to others. Therefore, the approaches are not reproducible by others and hard to compare between each other.

To handle the issue, here we publish a keyword dataset for our Snapdragon® mobile platform. The Hey Snapdragon keyword dataset contains 4,270 audio clips of four English keyword classes spoken by 50 people.

Dataset of Tensor Programs Execution Times – QAST

Current deep learning frameworks like PyTorch or TensorFlow, allow for optimizing a computational graph representation. However, they do not tackle optimization of hardware-specific operator-level transformations, but rather rely on manually tuned and vendor-specific operator libraries.

Recently, this gap has been filled by TVM, a compiler framework that allows both graph-level and operator-level optimization in an end-to-end manner. For a given target hardware, each operator defines a schedule configuration space and TVM can compile the resulting tensor program and measure its execution time on the target hardware. This results in a hard optimization problem, in some GPU cases making the search space of a single conv2d operator consists of more than 106 configurations.

Current efforts overcome this issue by learning how to optimize tensor programs from data rather than heuristics. When considering tensor programs as data, the abstract syntax tree (AST) representation associated with an operator configuration provides a rich input space. Graph Neural Network (GraphNN) models are a good fit to work with AST as they preserve the graph structure allowing information propagation among nodes.

We hope this new dataset, comprising 12 unique conv2d workloads, will benefit the graph research community and raise interest in optimizing compiler research.

Qualcomm Rasterized Images Dataset

The Qualcomm Rasterized Images for Super-resolution Processing dataset was created to facilitate the development and research of super-resolution algorithms for gaming. The dataset consists of parallel captures of various scenes in different modalities and resolutions. It is designed to be diverse, with a variety of backgrounds and models, to better generalize to new video games.

The Qualcomm Rasterized Images for Super-resolution Processing dataset consists of sequences of computer-generated frames captured at 60 frames per second. For each frame, multiplemodalities, including color, depth and motion vectors, are rendered at different resolutions ranging from 270p to 1080p. These modalities can be generated using default parameters or mipbiased, jittered, or both mipbiased and jittered.

We hope that you find these datasets useful and can’t wait to see what the AI community builds with them. Follow the GitHub account for more updates, as well as the Qualcomm Innovation Center and the Qualcomm AI Hub.

Lea Heusinger-Jonda
Staff Program Analyst, Qualcomm Technologies

If you're building AI or vision-enabled products, you've come to the right place.

Qualcomm AI Research Makes Diverse Datasets Available to Advance Machine Learning Research

Pages

Topics

Contact

Address

Phone