EEVR (Emotion Elicitation in Virtual Reality) is a novel dataset specifically designed for language supervision-based pre-training of emotion recognition tasks, such as valence and arousal classification. It features high-quality physiological signals, including electrodermal activity (EDA) and photoplethysmography (PPG), acquired through emotion elicitation via 360-degree virtual reality (VR) videos. Additionally, it includes subject-wise textual descriptions of emotions experienced during each stimulus gathered from qualitative interviews. Emotional stimuli are selected to induce diverse emotions across all four quadrants of Russell's circumplex model.
The dataset consists of recordings from 37 participants and is the first dataset to pair raw text with physiological signals, providing additional contextual information that objective labels cannot offer. To leverage this dataset, we introduced the Contrastive Language Signal Pre-training (CLSP) method, which jointly learns representations using pairs of physiological signals and textual descriptions.
Our results show
that integrating self-reported textual descriptions with physiological signals significantly
improves performance on emotion recognition tasks, such as arousal
and valence classification. Moreover, our pre-trained CLSP model demonstrates
strong zero-shot transferability to existing datasets, outperforming supervised base
line models, suggesting that the representations learned by our method are more
contextualized and generalized. The dataset also includes baseline models for
arousal, valence, and emotion classification, as well as code for data cleaning
and feature extraction.
Kindly fill the following form to get access to the full dataset: form link.
Codes for EEVR can be found here.
To underscore the importance of integrating textual descriptions in emotion recognition, we introduce the Contrastive Language-Signal Pre-training (CLSP) method for extracting more contextualized representations.
The model was trained on physiological signals and text pairs to learn a joint
embedding space, where both modalities are closely aligned using a contrastive loss
function.
Following pre-training, we evaluated the model's performance on test subject data
using the leave-one-subject-out cross-validation approach.
CLSP uses separate neural networks to process two types of input data: physiological signals (PPG and EDA)
and text. For signal data, it uses linear, hidden layers sizes of 50 and 100, while for text, it
applies a pre-trained DistilBERT model. The model then optimizes a contrastive objective to increase
similarity in positive pairs and decrease it in negative ones. We observed that CLSP significantly
improved emotion
recognition in arousal and valence tasks compared to models trained only on signal data. This underscores
the value of adding text-based supervision to enhance emotion representation from physiological signals.
EEVR: A Virtual Reality-Based Emotion Dataset Featuring Paired Physiological Signals and Textual Descriptions © 2024 by Pragya Singh, Pushpendra Singh is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/.
<-- Pending peer review -->