Synesthesia is associated with distinctive patterns in dream content

Authors: Emily Cook and Kyle Napierkowski

See the full code in this repository on Github: https://github.com/knaps/synesthesia-dreams

Abstract

Dreams offer insight into how individual differences shape conscious experience in the absence of external input or task demands. This study examines whether synesthesia is linked to distinct patterns in dream content, suggesting underlying differences in cognitive architecture. Leveraging the statistical power of large-scale, naturalistic data, we analyzed 2,337 dream reports from Reddit, comparing 1,169 reports from self-identified synesthetes with 1,168 matched controls. Semantic embedding models and logistic regression achieved modest classification performance, indicating group-level differences in language use. Topic modeling revealed four themes—digital, interpersonal regret, diverse worlds, and violent conflict—that were significantly more prevalent in synesthete dreams. These results suggest that trait-level cognitive organization, as expressed in synesthetic perception, extends across states of consciousness and shapes the thematic content of dreams. The findings support theoretical accounts of dreaming as continuous with waking cognition and demonstrate how stable neurocognitive traits manifest in unstructured, self-generated thought.

Repository Contents

This repository includes the Python scripts used for data processing, analysis, and modeling described in the paper:

synesthesia.py: Main script for loading data (synesthete and baseline dream reports), preparing data (embeddings, PCA), training and evaluating classification models (Logistic Regression, Random Forest), performing hyperparameter tuning (GridSearchCV), running SHAP analysis, and conducting thematic analyses using keyword matching and BERTopic models.
bertopic_trainer.py: Script for training the base BERTopic model on a large corpus of dream reports, saving the model, and providing utilities for analyzing dream topics. Includes functions for loading data from a database, sentence splitting, and model persistence.
bertopic_model/: Directory containing the pre-trained BERTopic model (dreams_sentence_model) used for topic analysis in synesthesia.py.
requirements.txt: Lists the necessary Python packages to run the code.
.env (example): Configuration file template for environment variables (e.g., API keys, database URLs). Note: The actual .env file is not included for security reasons.
Data Files (Not Included): The raw dream report data (synesthesia_dreams.csv, baseline data) and intermediate processed files (synesthesia_processed.pkl, baseline_processed.pkl, gridsearch_results.joblib) are not included in this repository due to size and privacy considerations but were generated/used by the scripts.

Methodology Overview

The analysis pipeline involved:

Data Collection: Sourcing dream reports from Reddit (r/dreams) and identifying synesthete authors based on activity in r/Synesthesia. Date-matched controls were selected.
Feature Generation: Creating semantic embeddings (OpenAI text-embedding-3-large) for each dream report.
Classification: Training Logistic Regression and Random Forest models to distinguish between synesthete and control dreams based on embeddings. Hyperparameter tuning was performed using GridSearchCV.
Topic Modeling: Applying a pre-trained BERTopic model (trained on a larger dream corpus using bertopic_trainer.py) to identify latent topics in the synesthete/control dataset.
Theme Analysis: Hierarchically clustering the fine-grained BERTopic topics into broader themes and comparing theme prevalence between groups using Chi-squared tests.

Refer to the full paper for detailed methodology, results (including Tables 1-3), discussion, and limitations.

How to Use

Clone the repository.
Set up a Python environment and install dependencies:
```
pip install -r requirements.txt
```
Create a .env file based on the required variables (e.g., database connection string, OpenAI API key if needed for re-analysis).
Obtain the necessary data files (dream reports, pre-trained models if not using the included one).
Run the scripts (bertopic_trainer.py to train a new model, synesthesia.py for the main analysis workflow). Note: Running the full analysis requires access to the original datasets and potentially significant computational resources.

This code is provided for transparency and reproducibility of the methods described in the publication.