EEVR: A Dataset of Paired Physiological Signals and Textual Descriptions for Joint Emotion Representation Learning

Pragya Singh1, Ritvik Budhiraja1, Ankush Gupta1, Anshul Goswami1, Mohan Kumar2, and Pushpendra Singh1
1 IIIT-D, New Delhi, India
2 RIT, Rochester, New York, USA

EEVR (Emotion Elicitation in Virtual Reality) is a novel dataset for language supervision-based pre-training and emotion recognition tasks

Introduction

EEVR (Emotion Elicitation in Virtual Reality) is a novel dataset specifically designed for language supervision-based pre-training of emotion recognition tasks, such as valence and arousal classification. It features high-quality physiological signals, including electrodermal activity (EDA) and photoplethysmography (PPG), acquired through emotion elicitation via 360-degree virtual reality (VR) videos. Additionally, it includes subject-wise textual descriptions of emotions experienced during each stimulus gathered from qualitative interviews. Emotional stimuli are selected to induce diverse emotions across all four quadrants of Russell's circumplex model.

The dataset consists of recordings from 37 participants and is the first dataset to pair raw text with physiological signals, providing additional contextual information that objective labels cannot offer. To leverage this dataset, we introduced the Contrastive Language Signal Pre-training (CLSP) method, which jointly learns representations using pairs of physiological signals and textual descriptions.

Our results show that integrating self-reported textual descriptions with physiological signals significantly improves performance on emotion recognition tasks, such as arousal and valence classification. Moreover, our pre-trained CLSP model demonstrates strong zero-shot transferability to existing datasets, outperforming supervised base line models, suggesting that the representations learned by our method are more contextualized and generalized. The dataset also includes baseline models for arousal, valence, and emotion classification, as well as code for data cleaning and feature extraction.

Kindly fill the following form to get access to the full dataset: form link.
Codes for EEVR can be found here.

Contrastive Language Signal Pre-training (CLSP) method

The Architecture for Contrastive-Language Singal Pre-Training (CLSP)

To underscore the importance of integrating textual descriptions in emotion recognition, we introduce the Contrastive Language-Signal Pre-training (CLSP) method for extracting more contextualized representations.

The model was trained on physiological signals and text pairs to learn a joint embedding space, where both modalities are closely aligned using a contrastive loss function. Following pre-training, we evaluated the model's performance on test subject data using the leave-one-subject-out cross-validation approach.
CLSP uses separate neural networks to process two types of input data: physiological signals (PPG and EDA) and text. For signal data, it uses linear, hidden layers sizes of 50 and 100, while for text, it applies a pre-trained DistilBERT model. The model then optimizes a contrastive objective to increase similarity in positive pairs and decrease it in negative ones. We observed that CLSP significantly improved emotion recognition in arousal and valence tasks compared to models trained only on signal data. This underscores the value of adding text-based supervision to enhance emotion representation from physiological signals.

Results for Physiological Baseline without text using Hand-crafted features +NN and with text using CLSP on 296 text-signal pairs (seed=43, epoch=15)

License

EEVR: A Virtual Reality-Based Emotion Dataset Featuring Paired Physiological Signals and Textual Descriptions © 2024 by Pragya Singh, Pushpendra Singh is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/.

BibTeX

<-- Pending peer review -->