BASE
LINE

Neural signals only become meaningful when interpreted relative to themselves.

Why EEG fails in the wild

Most EEG research optimizes for benchmark accuracy on controlled lab datasets. Deployment reveals the gap between academic performance and real-world robustness.

Core hypothesis

"Subject-specific covariance normalization can partially stabilize feature distributions across sessions without requiring retraining of downstream models."

01

Inter-subject Variability

Resting alpha power can differ by 3× across individuals. Population-level models trained on pooled data are structurally biased against everyone they claim to represent.

~3× variance in α-power across subjects
02

Longitudinal Drift

EEG features shift gradually across sessions due to electrode impedance changes, circadian effects, and cognitive load history. Static calibrations become unreliable within hours.

Detectable drift in as few as 3 sessions
03

Wearable Constraints

Consumer EEG devices offer 2–4 channels, noisy preprocessing, and no gel. Clinical preprocessing pipelines designed for 64-channel lab setups fail in deployment.

2–4 channels vs 64–256 in research
04

Limited Calibration

Real-world users will not sit through 30-minute calibration sessions. Systems that need large labeled datasets per user are commercially non-viable.

< 5 min acceptable calibration time

Studying the stability–utility gap

The central question is not whether alignment works, since it measurably does, but whether improved stability translates to improved decoding performance. This evaluation framework tests three alignment strategies under identical wearable-constrained conditions and measures both drift reduction and classification accuracy independently.

Research tension
EEG features drift across sessionsobserved
Alignment reduces distributional driftconfirmed
Reduced drift ≠ improved accuracyfinding
Stability and utility are dissociablehypothesis
01
Baseline Estimation
μ, Σ = fit(X_calib)

From a calibration session, estimate the subject-specific mean vector μ and regularized covariance matrix Σ, capturing the individual's neural feature distribution at a reference point in time.

02
Covariance Alignment
z = Σ⁻¹/² (x − μ)

New sessions are projected into a covariance-normalized feature space. Mahalanobis deviation from the calibration baseline quantifies how far the current session has drifted from the subject's reference distribution.

03
Temporal Adaptation
μₜ = (1−α)μₜ₋₁ + αxₜ

Slow drift is tracked via exponential moving-average recalibration. The estimated baseline gradually follows the subject's evolving distribution, at the cost of potentially over-adapting to transient states.

04
Downstream Evaluation
f(z) → accuracy

Aligned features replace raw features as input to a downstream classifier. The research question: does reducing distributional drift improve decoding performance, or are the two quantities dissociable?

What alignment improves
Centroid drift (L₂ distance between session means)
Feature variance across sessions
Cross-session distributional stability
What alignment does not improve
Within-session class separability (Fisher ratio)
Downstream classification accuracy
The signal-to-noise ratio of the decoding task
Research question

"Does subject-specific covariance normalization stabilize EEG feature distributions across sessions, and if so, does that stability translate to improved downstream decoding performance under wearable deployment constraints?"

The answer found here: yes to the first, no to the second. The dissociation between these two outcomes is the principal finding and the motivation for further investigation.

What the data shows

Evaluated on real EEG data from BNCI Horizon 2020 (dataset 001-2014), subject A01, trained on session A01T and evaluated on future session A01E. A 4-channel wearable-constrained subset was used to simulate realistic deployment conditions.

Central finding

"Covariance normalization substantially reduced longitudinal feature instability, though reduced drift did not necessarily improve downstream decoding accuracy."

Feature Shift, Raw
4.06
L₂ drift, A01T→A01E
Feature Shift, Whitened
0.22
↓ 94.6% reduction
Classification Accuracy
57.6%
raw = whitened (no decoding gain)
Moving Avg Accuracy
51.4%
adaptation reduced performance

Feature shift measured as L₂ distance between session-mean feature vectors (train A01T vs test A01E). Covariance whitening reduced longitudinal instability by 94.6%, from 4.06 to 0.22. Moving-average adaptation achieved moderate reduction (86.5%). This is the clearest finding in the evaluation.

01
Drift was real and measurable

Raw feature shift of 4.06 confirms longitudinal instability under wearable constraints. This is the core problem Baseline is designed to address.

02
Whitening stabilized features

Covariance normalization reduced shift from 4.06 to 0.22, a 94.6% reduction, demonstrating effective statistical alignment.

03
Accuracy was not improved

Feature stabilization did not translate to decoding gains. This dissociation suggests the downstream model absorbed the remaining variance independently.

Four views into one problem

Each module is a different lens on the same phenomenon: longitudinal distribution shift in EEG systems. All simulations are deterministic and run in the browser, with no backend required. Seed-controlled for reproducibility.

Module I

Drift Geometry Lab

Visualize session-to-session feature space drift and the limits of alignment

Core insight

Alignment reduces centroid drift. It does not improve class separability.

What these modules share

Across all four views, the same structural finding emerges: EEG feature distributions shift across time in ways that are measurable and partially correctable. Correction and improvement are not the same thing.

01Drift is real and measurable across sessions
02Alignment methods reduce distributional shift
03Reduced drift does not guarantee better decoding
04Stability and utility are not identical
Öykü Nur Kesek
Student Researcher
Öykü Nur Kesek

Building infrastructure
for minds at the
margin of measurement.

I'm interested in the gap between laboratory neuroscience and real-world neurotechnology, specifically why EEG systems that perform well in controlled environments fail to remain stable across time, users, and deployment conditions.

My current work centers on BASELINE, a research investigation into longitudinal distribution shift in wearable EEG systems. The core question: can subject-specific statistical alignment stabilize feature distributions across sessions, and if it can, does that stability actually improve decoding performance? The answer, as the data shows, is that these two things are dissociable.

My interests sit at the intersection of EEG signal processing, statistical learning, and computational neuroscience infrastructure, with a focus on the systems-level constraints that govern real-world biosignal deployment.

Research interests

Themes that cut across the technical work. The questions that make the engineering decisions feel necessary.

01Wearable EEG deployment
02Longitudinal signal adaptation
03Subject-specific statistical alignment
04Systems thinking in biosignal ML
05BCI under constraint

Personal Baseline Deviation vs. Population Classification for Wearable EEG Stress Detection: A Pilot Study

Comparing unsupervised subject-specific personalisation against supervised cross-subject gradient boosting under two-electrode temporal constraints · SAM40 dataset · n = 40

Abstract

This study tests whether an unsupervised personal baseline deviation model outperforms a supervised population classifier for EEG stress detection under wearable-constrained conditions, using two temporal electrodes (T7/T8) to simulate consumer behind-ear devices. On the SAM40 dataset (40 subjects, 32-channel EEG, 128 Hz), the personal baseline model significantly outperformed the population classifier in the wearable condition (accuracy: 0.611 vs 0.538, p=0.025, r=0.355) and full 32-channel condition (accuracy: 0.693 vs 0.619, p=0.044, r=0.318). SHAP analysis identified temporal alpha differential entropy as the dominant stress biomarker, 2.4× more important than any other band. Alpha suppression occurred in 27/40 subjects; the remaining 13/40 showing enhancement represent a subpopulation for whom directionally rigid models fail. Results constitute an upper bound on real wearable performance as preprocessing used all 32 channels before electrode extraction.

Keywordswearable EEGpersonalizationstress detectionlongitudinal variabilitydifferential entropybrain-computer interface

Psychological stress is a growing public health concern, and its objective measurement through physiological signals has attracted significant research attention. Electroencephalography (EEG) is an effective tool for identifying stress as it detects the cognitive aspects of stress prior to the emergence of peripheral reactions such as changes in heart rate or changes in skin conductance [1]. Even though EEG wearables have made ambulatory brain monitoring increasingly accessible, two fundamental problems prevent reliable deployment.

First is the inter-subject variability: resting alpha power, peak frequency, and spectral distributions vary significantly among individuals [1, 2]. Population classifiers trained on averaged patterns across many subjects learn a mean that may not represent any individual accurately, causing systematic misclassification in subjects whose baseline deviates from the group mean. Second, the electrode constraint: devices such as the Emotiv MN8 and use only two electrodes. Specifically, Emotiv MN8 uses electrodes that are behind the ear at temporal positions T7 and T8. Detection algorithms developed for 32-channel laboratory systems, however, do not indicate the extent to which performance deteriorates under this constraint.

This study addresses both problems simultaneously. We compare an unsupervised personal baseline deviation model against a supervised gradient-boosting population classifier under T7/T8-restricted and full 32-channel conditions. Previous work has benchmarked cross-subject classifiers on public datasets including DEAP [10], but the specific question of whether personalisation advantage persists under two-channel wearable constraints has not been quantified with proper leave-one-subject-out evaluation. This is Layer 1 of a three-layer research program: pilot analysis on public data (this study), original data collection with validated stress induction (Layer 2), and real wearable hardware validation (Layer 3).

Primary literature informing the evaluation design and contextualising the findings within EEG adaptation research.

[1]Saha S and Baumert M (2020) Intra- and Inter-subject Variability in EEG-Based Sensorimotor Brain Computer Interface: A Review. Front. Comput. Neurosci. 13:87. doi: 10.3389/fncom.2019.00087
[2]Apicella, Andrea, et al. "Toward cross-subject and cross-session generalization in EEG-based emotion recognition: Systematic review, taxonomy, and methods." Neurocomputing 604 (2024). https://doi.org/10.1016/j.neucom.2024.128354.
[3]Fdez J, Guttenberg N, Witkowski O and Pasquali A (2021) Cross-Subject EEG-Based Emotion Recognition Through Neural Networks With Stratified Normalization. Front. Neurosci. 15:626277. doi: 10.3389/fnins.2021.626277
[4]Schapkin SA, Raggatz J, Hillmert M, Böckelmann I. EEG correlates of cognitive load in a multiple choice reaction task. Acta Neurobiol Exp (Wars). 2020;80(1):76-89. PMID: 32214277.
[5]Vos G, Ebrahimpour M, van Eijk L, Sarnyai Z, Rahimi Azghadi M. Stress monitoring using low-cost electroencephalogram devices: A systematic literature review. Int J Med Inform. 2025 Jun;198:105859. doi: 10.1016/j.ijmedinf.2025.105859. Epub 2025 Mar 6. PMID: 40056845.
[6]Berretz G, Packheiser J, Wolf OT, Ocklenburg S. Acute stress increases left hemispheric activity measured via changes in frontal alpha asymmetries. iScience. 2022 Feb 1;25(2):103841. doi: 10.1016/j.isci.2022.103841. PMID: 35198894; PMCID: PMC8850739.
[7]Athavipach C, Pan-Ngum S, Israsena P. A Wearable In-Ear EEG Device for Emotion Monitoring. Sensors (Basel). 2019 Sep 17;19(18):4014. doi: 10.3390/s19184014. PMID: 31533329; PMCID: PMC6767669.
[8]Moumane H, Pazuelo J, Nassar M, Juez JY, Valderrama M and Le Van Quyen M (2024) Signal quality evaluation of an in-ear EEG device in comparison to a conventional cap system. Front. Neurosci. 18:1441897. doi: 10.3389/fnins.2024.1441897
[9]Koelstra, Sander & Mühl, Christian & Soleymani, Mohammad & Lee, Jong-Seok & Yazdani, Ashkan & Ebrahimi, Touradj & Pun, Thierry & Nijholt, Anton & Patras, Ioannis. (2011). DEAP: A Database for Emotion Analysis Using Physiological Signals. IEEE Transactions on Affective Computing. 3. 18-31. 10.1109/T-AFFC.2011.15.
[10]Lebepe, F. & Niezen, G. & Hancke, G. & Ramotsoela, Daniel. (2016). Wearable stress monitoring system using multiple sensors. 895-898. 10.1109/INDIN.2016.7819288.
[11]Ju, Xiangyu & Li, Ming & Tian, Wenli & Hu, Dewen. (2023). EEG-based emotion recognition using a temporal-difference minimizing neural network. Cognitive Neurodynamics. 18. 10.1007/s11571-023-10004-w.
[12]Kirschbaum C, Pirke KM, Hellhammer DH. The 'Trier Social Stress Test'—a tool for investigating psychobiological stress responses in a laboratory setting. Neuropsychobiology. 1993;28(1-2):76-81. doi: 10.1159/000119004. PMID: 8255414.
[13]Gramfort A, Luessi M, Larson E, Engemann DA, Strohmeier D, Brodbeck C, Goj R, Jas M, Brooks T, Parkkonen L and Hämäläinen M (2013) MEG and EEG data analysis with MNE-Python. Front. Neuroinform. 7:267. doi: 10.3389/fnins.2013.00267
[14]Lundberg, Scott & Lee, Su-In. (2017). A Unified Approach to Interpreting Model Predictions. 10.48550/arXiv.1705.07874.

Limitations, caveats, and what comes next

Research-grade work requires honest accounting of its boundaries. These notes document where Baseline's assumptions hold, where they break down, and what the real evaluation revealed.

N01

On the scope of this evaluation

Experiments use real EEG data from BNCI Horizon 2020 (001-2014), single subject A01, evaluated across two sessions (A01T to A01E). Single-subject evaluation is a known limitation. The observed drift reduction may not generalize across subjects, devices, or tasks. These results are proof-of-concept, not deployment-ready benchmarks.

N02

Method limitations

Covariance whitening assumes the training session covariance is representative of the long-run statistic. In practice, a single session may under-sample the distribution. Moving-average adaptation can over-fit to short-term transient states if the decay rate is too aggressive relative to the actual drift timescale.

N03

What this system does not claim

Baseline is not a classifier. It makes no clinical claims about cognitive state, mental health, or neurological function. The dissociation between feature stability and decoding accuracy observed here should be interpreted as a constraint on what statistical alignment alone can provide.

N04

Open questions

Why did covariance whitening fail to improve accuracy despite 94.6% drift reduction? Is the discriminative signal for motor imagery orthogonal to the high-variance drift directions? Can Riemannian alignment outperform covariance whitening on multi-subject longitudinal evaluation? What is the minimum viable calibration protocol for wearable BCI deployment?

Future directions

The current evaluation establishes a methodological baseline. The deeper question it opens: how should future neurotechnology systems adapt to humans as continuously evolving biological distributions rather than static users?

01
Continuous adaptation without recalibration

Systems capable of updating personal neural representations longitudinally without requiring explicit recalibration sessions — treating alignment as a persistent background process rather than an upfront cost.

02
Calibration-light wearable neurotechnology

Wearable EEG systems designed around passive adaptation and minimal user burden. Practical deployment imposes hard constraints on setup time; the alignment layer should absorb longitudinal variation silently.

03
Persistent personal neural embeddings

Long-term subject-specific representation spaces capable of tracking gradual cognitive and physiological change across months or years. Whether such embeddings remain discriminative at that timescale is an open empirical question.

04
Uncertainty-aware neural interfaces

Adaptive systems that estimate signal reliability and drift in real time, flagging when alignment has degraded rather than silently producing uncertain predictions under noisy real-world conditions.

05
Hardware and software co-design

Wearable neurotechnology designed jointly with adaptive alignment infrastructure, rather than treating signal instability purely as a post-processing residual. Electrode placement, signal conditioning, and adaptation as a unified system.

06
Federated personalization

Privacy-preserving personalization allowing wearable neural devices to improve longitudinally without centralized storage of raw neural data — a necessary constraint for any deployment at population scale.

07
Adaptive human-centered AI

Whether future AI systems interacting with biological users require temporally adaptive representations rather than fixed assumptions about users. BASELINE examines one constrained, measurable piece of that larger question.

BASELINE
EEG Feature Alignment · Research Prototype · 2026
Not a medical device. Not a diagnostic tool. A research concept exploring lightweight personalization under wearable EEG deployment constraints.
Research
About