Lightweight Model Attribution and Detection of Synthetic Speech via Audio Residual Fingerprints

This project is maintained by blindconf

Synthetic Audio Example Demo

This page presents supplementary material with a selection of synthetic audio samples generated by the models used in our experiments (paper).
Real audio samples are derived from the LJ Speech Dataset.

Experiment Summary

We investigate open-world single-model attribution using Residual Fingerprints (RFPs).
RFPs achieve near-perfect AUROC (≈1.0) in distinguishing target synthesis systems from unseen generative models and real speech, demonstrating strong generalization.

Under realistic audio perturbations — such as noise, echo, and compression — RFPs maintain high attribution accuracy. When perturbations are severe, performance can be effectively restored through simple data augmentation during RFP construction.

Real Audio Sample: LJ001-0001.wav

 Transcription:       Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition.

Synthetic Audio Samples:

MelGAN Large:
Avocodo:
Big V GAN:
HiFi GAN:
Multi-band MelGAN:
Parallel Wave GAN:
Wave Glow:
Harmonic Noise source Filter:
Fast Diff:
Pro Diff:

Audio Corruption Effects:

Echo Effect (strength 0.5 and 500ms delay):
Background Noise (Avg. SNR of 17.96 dB):
Reverberation:
MP3 Compression:

Real Audio Sample: LJ001-0002.wav

 Transcription:       In being comparatively modern.

Synthetic Audio Samples:

MelGAN Large:
Avocodo:
Big V GAN:
HiFi GAN:
Multi-band MelGAN:
Parallel Wave GAN:
Wave Glow:
Harmonic Noise Source Filter:
Fast Diff:
Pro Diff:

Audio Corruption Effects:

Echo Effect (strength 0.5 and 500ms delay):
Background Noise (Avg. SNR of 17.96 dB):
Reverberation:
MP3 Compression: