Low-resource Swiss German TTS

Text vs. phoneme intermediates for Swiss German speech synthesis

Reza Kakooee, Vincenzo Timmel, Daniel Perruchoud, Michael Graber, and Manfred Vogel

Institute for Data Science, School of Computer Science, FHNW

OpenReview Results BibTeX

DE Text CH Voice

DE-TTSdirect High German to Swiss German speech

CH-TTStranslate to Swiss German text, then synthesize

PH-TTSconvert to fused phoneme strings, then synthesize

Swiss German has no standardized orthography, while practical user input is often written in High German. This paper asks which intermediate representation should sit between High German input and Swiss German speech: High German text, Swiss German text, or phoneme strings.

Experiment

Three routes from High German text to Swiss German voice

The same task is evaluated through three different intermediate representations. The pipeline figure shows where translation, phoneme conversion, synthesis, and closed-loop STT scoring enter the system.

Route 01

DE-TTS

Use High German text directly as the conditioning input for Swiss German speech.

Route 02

CH-TTS

Translate High German to Swiss German text first, then synthesize Swiss German speech.

Route 03

PH-TTS

Convert speech-derived phoneme supervision into fused word-level phoneme strings.

Backbones

A compact model and a speech-LLM baseline

The paper fine-tunes two contrasting TTS backbones to separate representation effects from synthesis capacity.

SpeechT5

A lightweight encoder-decoder baseline for data-efficient fine-tuning.

Orpheus

A Llama-based speech generation model used as the stronger synthesis baseline.

Full vs. half data

Orpheus is also trained with half the data to test robustness under lower supervision.

Results

Human listeners prefer the Swiss German text route

Closed-loop STT metrics are useful diagnostics, but they do not fully match human preference. MOS consistently ranks CH-TTS highest among synthesized systems.

Best synthesized MOS CH-TTS

Highest listener preference for both SpeechT5 and Orpheus.

SpeechT5 CH-TTS 4.00

Close to original recordings at 4.04 MOS.

Orpheus CH-TTS 4.67

Near-original quality compared with 4.86 MOS for real recordings.

Half-data CH-TTS 4.56

Most stable Orpheus branch when training data is reduced.

Boxplot comparison of WER and SacreBLEU metrics for DE-TTS, CH-TTS, and PH-TTS. — **Closed-loop STT metrics.** Transcript-overlap scores can favor the direct DE-TTS route because the reference is High German text, even when listeners prefer the more dialect-appropriate CH-TTS output.

What the metrics say

WER and SacreBLEU reliably penalize PH-TTS in the current setup.
Objective metrics do not capture the preference gap between DE-TTS and CH-TTS.
PH-TTS becomes closer to DE-TTS under the half-data Orpheus setting, suggesting phoneme intermediates may become more competitive when resources are scarcer.

Interpretation

Phonemes are promising, but the supervision pipeline is the bottleneck

PH-TTS is not evaluated as an oracle phoneme condition. Its phoneme strings are automatically inferred from audio and then fused into compact strings, so upstream noise and representation mismatch directly affect synthesis quality.

Current bottleneck

Noisy audio-derived phonemes plus discrete-to-fused conversion errors.

Promising fix

Predict fused phoneme strings directly and use a phoneme-native tokenizer.

Deployment hint

For now, CH-TTS with Orpheus offers the best quality and robustness trade-off.

Citation

BibTeX

@inproceedings{kakooee2026swisstext,
  title  = {Text vs. Phoneme Intermediates for Low-Resource Swiss German Text-to-Speech},
  author = {Kakooee, Reza and Timmel, Vincenzo and Perruchoud, Daniel and Graber, Michael and Vogel, Manfred},
  booktitle = {Proceedings of SwissText},
  year   = {2026}
}