Name	Name	Last commit message	Last commit date
Latest commit History 42 Commits
data	data
doc	doc
.gitignore	.gitignore
demo.ipynb	demo.ipynb
environment.yml	environment.yml
evalcsv_vctk.py	evalcsv_vctk.py
experiments_libri.ipynb	experiments_libri.ipynb
experiments_vctk.ipynb	experiments_vctk.ipynb
extract_wavlm_libri.py	extract_wavlm_libri.py
extract_wavlm_vctk.py	extract_wavlm_vctk.py
intelligibility.py	intelligibility.py
license.md	license.md
linearvc.py	linearvc.py
log.md	log.md
readme.md	readme.md
reduced_rank_ridge.py	reduced_rank_ridge.py
resample_vad.py	resample_vad.py
speaker_similarity.py	speaker_similarity.py
utils.py	utils.py

LinearVC: Linear transformations of self-supervised features through the lens of voice conversion

Overview

Voice conversion is performed using just linear regression. The work is described in:

H. Kamper, B. van Niekerk, J. Zaïdi, and M-A. Carbonneau, "LinearVC: Linear transformations of self-supervised features through the lens of voice conversion," in Interspeech, 2025.

Samples: https://www.kamperh.com/linearvc/

Quick start

Programmatic usage

Install the dependencies in environment.yml or run conda env create -f environment.yml and check that everything installed correctly. The steps below are also illustrated in the demo notebook.

import torch
import torchaudio
device = "cuda" # "cpu"
# Load all the required models
wavlm = torch.hub.load(
 "bshall/knn-vc", 
 "wavlm_large", 
 trust_repo=True, 
 progress=True, 
 device=device, 
)
hifigan, _ = torch.hub.load(
 "bshall/knn-vc",
 "hifigan_wavlm",
 trust_repo=True,
 prematched=True,
 progress=True,
 device=device,
)
linearvc_model = linearvc.LinearVC(wavlm, hifigan, device)
# Lists of source and target audio files
source_wavs = [
 "<filename of audio from source speaker 1>.wav",
 "<filename of audio from source speaker 2>.wav",
 ...,
]
target_wavs = [
 "<filename of audio from target speaker 1>.wav",
 "<filename of audio from target speaker 2>.wav",
 ...,
]
# Source input utterance
input_features = linearvc_model.get_features("<filename>.wav")
# Voice conversion projection matrix
W = linearvc_model.get_projmat(
 source_wavs,
 target_wavs,
 parallel=True, # enable if parallel
 vad=False,
)
# Project the input and vocode
output_wav = linearvc_model.project_and_vocode(input_features, W)
torchaudio.save("output.wav", output_wav[None], 16000)

If parallel=True, utterances with the same filename are paired up. If parallel=False, the utterances don't have to align, but then you need more data (3 minutes per speaker is good, more than that doesn't help much).

Script usage

Perform LinearVC by finding all the source and target audio files in given directories:

./linearvc.py \
 --extension .flac \
 ~/LibriSpeech/dev-clean/1272/ \
 ~/LibriSpeech/dev-clean/1462/ \
 ~/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac \
 output.wav

When parallel utterances are available, much less data is needed. Running the script with --parallel as below scans two directories and pairs up all utterances with the same filename. E.g. below it finds 002.wav, 003.wav, etc. in the p225/ source directory and then pairs these up with the same filenames in the p226/ directory.

./linearvc.py \
 --parallel \
 data/vctk_demo/p225/ \
 data/vctk_demo/p226/ \
 data/vctk_demo/p225/067.wav \
 output2.wav

Full script details:

usage: linearvc.py [-h] [--parallel] [--lasso LASSO] [--vad]
 [--extension {.flac,.wav}]
 source_wav_dir target_wav_dir input_wav output_wav
Perform voice conversion with linear regression.
positional arguments:
 source_wav_dir directory with source speaker speech
 target_wav_dir directory with target speaker speech
 input_wav input speech filename
 output_wav output speech filename
options:
 -h, --help show this help message and exit
 --parallel whether source and target utterances are parallel, in
 which case the filenames in the two directories should
 match
 --lasso LASSO lasso is applied with this alpha value
 --vad voice activatiy detecion is applied to start of
 utterance
 --extension {.flac,.wav}
 source and target audio file extension (default:
 '.wav')

Experiments on all utterances (LibriSpeech)

These experiments are described in (Kamper et al. 2025).

Extract WavLM features:

./extract_wavlm_libri.py \
 --exclude data/eval_inputs_dev-clean.txt \
 ~/endgame/datasets/librispeech/LibriSpeech/dev-clean/ \
 ~/scratch/dev-clean/wavlm_exclude/
./extract_wavlm_libri.py \ 
 --exclude data/eval_inputs_test-clean.txt \
 ~/endgame/datasets/librispeech/LibriSpeech/test-clean/ \
 ~/scratch/test-clean/wavlm_exclude/

Experiments with all utterances:

jupyter lab experiments_libri.ipynb

Experiments on parallel utterances (VCTK)

These experiments are not described in the paper but are still interesting.

Downsample speech to 16kHz:

# Development set
./resample_vad.py \
 data/vctk_scottish.txt \
 ~/endgame/datasets/VCTK-Corpus/wav48/ \
 ~/scratch/vctk/wav/scottish/
# Test set
./resample_vad.py \
 data/vctk_english.txt \
 ~/endgame/datasets/VCTK-Corpus/wav48/ \
 ~/scratch/vctk/wav/english/

Create the evaluation dataset (which is already in the data/ directory released with the repo):

./evalcsv_vctk.py \
 data/vctk_scottish.txt \
 /home/kamperh/scratch/vctk/wav/scottish/ \
 data/speakersim_vctk_scottish_2024年09月16日.csv
./evalcsv_vctk.py \
 data/vctk_english.txt \
 /home/kamperh/scratch/vctk/wav/english/ \
 data/speakersim_vctk_english_2024年09月16日.csv

Extract features for particular parallel utterances (for baselines):

./extract_wavlm_vctk.py --utterance 008 \
 ~/scratch/vctk/wav/english/ ~/scratch/vctk/english/wavlm_008/

Experiments with parallel utterances:

jupyter lab experiments_vctk.ipynb

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

anon-uscf/uscf

Folders and files

Latest commit

History

Repository files navigation

LinearVC: Linear transformations of self-supervised features through the lens of voice conversion

Overview

Quick start

Programmatic usage

Script usage

Experiments on all utterances (LibriSpeech)

Experiments on parallel utterances (VCTK)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LinearVC: Linear transformations of self-supervised features through the lens of voice conversion

Overview

Quick start

Programmatic usage

Script usage

Experiments on all utterances (LibriSpeech)

Experiments on parallel utterances (VCTK)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages