Abstract: Voice conversion (VC) aims to modify the speaker’s timbre while retaining speech content. Previous approaches have tokenized the outputs from self-supervised into semantic tokens, ...