Voice Cloning (Recording Guide)

Create high-quality custom voices with pristine training data

Voice Cloning Interface — Voice Cloning interface

Quality is Everything

To create a high-quality custom voice, the input audio (training data) must be pristine. The AI replicates everything it hears—including imperfections.

1. Recording Location

Minimize room echo and reverb. "Deaden" the room using one of these options:

Vocal booth: Professional isolation booth (best option)
Closet: A closet full of clothes naturally absorbs sound
Blanket fort: Surround yourself with heavy blankets to dampen reflections

Quick Test

Clap your hands in the room. If you hear a noticeable echo or ring, the room needs more acoustic treatment.

2. Equipment

Equipment	Recommendation	Notes
Microphone	Professional XLR mic ($150-$300)	Audio-Technica AT2020 or Rode NT1 recommended
Interface	USB Audio Interface	Focusrite Scarlett series recommended
Pop Filter	Essential	Prevents plosives (air hitting the mic on P, B, T sounds)

3. Software (DAW)

Use professional recording software like REAPER or Audacity.

Recommended Settings

Setting	Value
File Format	WAV (uncompressed)
Sample Rate	44.1 kHz or 48 kHz
Bit Depth	24-bit
Peak Levels	-6 dB to -3 dB
Average Loudness	-18 dB

Avoid Clipping

Never let your levels hit 0 dB. Clipped audio cannot be fixed and will ruin your voice clone.

4. Positioning

Distance: Maintain approximately 20cm (about two fists) from the microphone
Angle: Speak slightly off-axis (at an angle) to avoid direct breathing into the mic
Consistency: Keep the same position throughout the entire recording session

5. Performance

Consistency is Key

Keep the same energy and tone throughout your recording. Do not mix different styles in the training file:

✅ Do	❌ Don't
Maintain consistent volume	Mix whispering and shouting
Keep the same energy level	Vary between excited and tired delivery
Use natural pacing	Rush through some parts
Speak clearly and cleanly	Include "uhms", stutters, or mistakes

Critical Warning

The AI replicates everything, including breaths, stutters, lip smacks, and "uhms". Ensure the performance is exactly what you want cloned. Any imperfections in your training data will appear in the generated voice.

Summary Checklist

The Room is acoustically treated (no echo)
Using a quality XLR microphone with pop filter
Recording in WAV format at 44.1/48 kHz, 24-bit
Levels peaking at -6 to -3 dB
Positioned 20cm from mic at a slight angle
Consistent tone and energy throughout
No stutters, "uhms", or unwanted sounds