Datasets


This page provides an overview of datasets developed and curated in my research context. The collection reflects work on spoken dialogue, voice assistants, health-related speech, and multimodal interaction.

Some datasets are publicly documented and partially accessible, while others are available only under restricted access due to ethical, legal, or privacy-related constraints.


GerParlDia-MM

A multimodal corpus of German parliamentary speeches (1949-2025), designed for longitudinal research on voice, language, and rhetorical change across decades.


Queer Waves

A dataset related to voice, identity, and queer perspectives in speech communication research. The dataset supports research on inclusive speech technologies and sociophonetic or sociotechnical questions.


CCC – Common Cold Corpus

A speech corpus designed for the analysis of cold-related voice changes and health-related speech phenomena. It supports work on speech under physiological variation and health-aware speech processing.


RBC – Restaurant Booking Corpus

A corpus for spoken dialogue research in the context of restaurant booking scenarios. It supports work on task-oriented dialogue, spoken language understanding, and conversational system evaluation.


VACC – Voice Assistant Conversations Corpus

A dataset capturing voice assistant interactions in more naturalistic or real-world settings. It is particularly relevant for studying spontaneous use, conversational patterns, and device-directed speech in everyday environments.


VAWC – Voice Assistant Conversations in the Wild

A dataset capturing voice assistant interactions in more naturalistic or real-world settings. It is particularly relevant for studying spontaneous use, conversational patterns, and device-directed speech in everyday environments.


iGF-Corpus – Integrated Health and Fitness Corpus

A corpus developed in the context of health, fitness, and speech-related behavioral or physiological data. It supports multimodal analyses at the intersection of speech, activity, and health-related signals.


LMC – Last Minute Corpus

A corpus related to spontaneous, time-constrained, or dynamically produced speech. It can support analyses of urgency, spontaneity, and speech behavior in less scripted interaction scenarios.


Notes on Access and Reuse

Please note that not all datasets can be publicly redistributed in full. In several cases, access to audio, transcripts, or annotations is restricted due to consent conditions, privacy considerations, or ethical constraints.

Where possible, this website provides:

If you are interested in a specific dataset, please refer to the corresponding link.