top of page

A Basic Tutorial on UTAU, OREMO, and SetParam for Making and Using Voicebanks

SetParam
A Little (But Important) Lesson on Audio and Linguistics (?)

SetParam           

SetParam is a program that you will be using to create an “oto.ini” file.

An “oto.ini” file is basically just a file describing the voice configurations for each syllable you recorded.

 

(People sometimes use the word “oto” as a verb; so if someone says they will “oto” a voicebank, that means that they will create an “oto.ini” file.)

 

These voice configuration details are heavy, time-consuming stuff. That’s why often times, many people ask or pay others who are more knowledgeable to oto a voicebank for them. If you want to do this, then there really is no reason for you to download or read this section about SetParam. Just ask someone else to oto for you.

 

However, if you want to create your own oto.ini file and really understand how UTAU works, then you’ll want to read this section carefully. Before we get into the specifics on voice configurations and the oto.ini, though, I’ll need to explain a few things first.

A Little (But Important) Lesson on Audio and Linguistics (?)           

Before we get to SetParam, there’s a few things about recordings that you should probably know.

 

Each syllable has two parts: a consonant and a vowel. The consonant part is basically the little part in every syllable that you say before saying the actual vowel.

 

For instance, when you say the syllable “sha,” you can hear the “sh” part before you actually hear the “a” vowel. The “sh” part is the consonant, and the “a” part is the vowel.

Similarly, whenever you say “ku,” you can hear the “k” part before you actually hear the “u” vowel. The “k” part is the consonant and the “u” part is the vowel.

 

Obviously, then, for syllables that are just vowels themselves (ie. “a,” “i,” “u,” “e,” “o,” and generally “n”) do not have consonants and only have vowels.

*NOTE: Because of the consonant-vowel system, the voicebank you are currently making right now is called a “CV voicebank.”

 

CV voicebanks are the most popular and common because they’re the easiest to make. Other types of voicebanks exist as well, the other popular ones currently being VCV (vowel-consonant-vowel) and CVVC (consonant-vowel-vowel-consonant).

 

CV requires you to simply record each syllable separately, one time. VCV and CVVC require you to record multiple syllables together in one sound file, and then take more oto-ing to create the finished voicebank.

 

Since CV is the easiest to make, if you’re making an UTAU voicebank for the first time, many people recommend making a CV voicebank first.

 

I won’t go too much into the details and differences between all of them, since this tutorial was meant mostly to give you a basic understanding of UTAU voicebanks. However, the images below should give you a rough idea of the differences between CV, VCV, and CVVC voicebanks.

 

CV:

VCV:

CVVC:

 

 

As you can probably tell, VCV and CVVC have a lot more extra recordings you need to do. Because of this, VCV and CVVC voicebanks sound a lot more fluid and realistic (and generally not as choppy) as CV voicebanks sometimes do. However, this doesn’t mean CV voicebanks are bad. There are many CV voicebanks that still sound amazing, so don’t worry too much about this for your first voicebank!

In the UTAU oto.ini world, there are six important words/phrases you should probably know:

  1. Preutterance.

    • You know how when you say a syllable, there’s always a little bit of consonant before you say the actual vowel? Preutterance, then, is basically that division in between the consonant and the vowel (ie. between the “sh” and “a” in “sha”). This applies for all the other syllables except just vowels themselves. In that case, there is no preutterance.

  2. Overlap.

    • This is how much you want the vowel of the previous note to overlap with the consonant of the current note. Generally, this is about half of the preutterance value.

  3. Consonant.

    • Not to be confused with the same word when referring to vowels and the like. In this case, when talking about oto.ini files, the consonant is the part where you want UTAU to stretch or loop for longer notes. The consonant is the reason why you don’t need to record voicebank sound files for a long time, since UTAU is able to stretch out the vowel.

    • For instance, if you want your UTAU to sing a “ra” note for three hours (not sure why, but if you wanted you could do it), you don’t need to actually record your ra.wav for three hours because UTAU can stretch out the “a” sound. Which part of your recording UTAU stretches out is at the consonant value you set.

    • Thus, it makes sense that your consonant value is generally in the middle of your voice file, where your vowel is the most consistent and stable.

  4. Offset.

    • The offset is the part you want UTAU to cut off in the beginning of your voice file (a.k.a. the part of your recording where you don’t say anything at all). You don’t want your UTAU to have a bunch of silence in the beginning of the files, do you?

  5. Cut-off.

    • The cut-off is basically the same thing as the offset except it is the part where you want UTAU to cut off at the end of your voice file (a.k.a. any silence you recorded after saying the syllable).

  6. Alias.

    • This one is easy. This is basically a “nickname” for the syllable you recorded. Usually, the alias is the Hiragana form of the syllable name if you recorded your voicebank in Romaji. If you recorded your voicebank in Hiragana, the alias would be in Romaji.

    • For instance, if you recorded a syllable in a file called “a.wav,” the alias would be “あ.” If you recorded a syllable in a file called “つ.wav,” the alias would be “tsu.”

    • This is basically just a way for UTAU to recognize both Hiragana and Romaji, and makes it so that you don’t have to record multiple times to have both.

    • This isn’t just limited to Japanese, though. You can do it for any language. If you want to alias, say, an Arabic voicebank, you can set the alias to the romanization of Arabic syllables, and vice versa.

And that’s about all you need to know for oto.ini lingo!

Now we can start using SetParam.

Using SetParam           

Now, open up SetParam. The first thing you need to do is find that folder where you saved all your voice files and click “OK.”

 

It’ll prompt you to “load a voice configuration file.” Select “Don’t load,” since obviously your UTAU voicebank doesn’t have an oto.ini file just yet. (But if you’re editing a previous oto.ini file, click “Load” and navigate to your oto.ini file.)

 

Now, you should get two screens: a screen with one of the syllables you recorded, and a screen with a list of all the syllables you have in the folder. Like in OREMO, make sure the “Show Waveform,” “Show Spectrum,” “Show Power,” and “Show F0” options are selected in the first screen.

 

First screen:

Using SetParam

Second screen:

Here are some hotkeys you will need to know for SetParam:

  • F1: sets the Offset (“Left Blank”) value. (black line shaded green)

  • F2: sets the Overlap value. (green line)

  • F3: sets the Preutterance value. (red line)

  • F4: sets the Consonant value. (blue line)

  • F5: sets the Cutoff (“Right Blank”) value. (black line shaded yellow)

(For some computers, you may need to press “shift” along with the function key to get it to work.)

 

On the first screen, just click anywhere to play the sound recording.

For setting the different values, simply hover over the place you want to place your Offset/Overlap/Preutterance/Consonant/Cuttoff and then press the appropriate key on your keyboard to set the value.

 

So, for a consonant-vowel sound, I might have this configuration:

And for a vowel sound, I might have this:

(Don’t worry if you have noise in the beginning of your sound files; the Offset value will get rid of it in UTAU anyway.)

Notice where each of my values were place in my “cha” recording. Make use of the Spectrum, Power, and F0 graphs! Generally, you can set your Preutterance value to the middle of the dip on the Power graph or the start of the dots on the F0 graph, your Overlap value to the first peak of the Power graph or halfway between the Offset and Preutterance, and your Consonant value to somewhere in between the beginning and the end.

 

Of course, not all recordings are the same. Your graphs might be a bit different for your sound recordings. However, they should look generally alike. Experiment a bit to find good values!

 

Notice how my vowel sound has the same values for Offset, Overlap, and Preutterance. This is because there is no consonant sound in the beginning of the vowel, so no Preutterance is necessary.

*NOTE: Remember, the word “vowel” doesn’t necessarily have to only apply to the standard vowels of the English language (ie. “a,” “i,” “u,” “e,” and “o”). For oto configurations, “vowel” also applies to any one-character syllables, like “n” and “m.” So make sure your “n” and “m” configurations are like your standard vowel ones!

Breath sounds are a little bit trickier. You’d want your Preutterance to be maybe about a third of the way through, with a small Overlap value. You can try to use this as a reference:

The Consonant value is also a bit strange for breath sounds. Since you want to ensure that UTAU plays the whole breath sound without stretching it out too early, you’d want your Consonant value to be big.

 

For breath-out sounds, again, make sure the Consonant is pretty big:

As you oto each syllable, you should notice that the second screen starts to fill in with numbers for each of the columns.

 

Once you’re done going through every single recording and oto-ing all of them (this is why some people pay others to oto for them; it takes a long time!), it’s time to alias all of them!

Use the Hiragana chart to type them all in on the second screen:

https://en.wikipedia.org/wiki/Hiragana

*NOTE: Some of the more obscure sounds are not on Wikipedia:

Da: だ, De: で, Di: でぃ, Do: ど, Du: どぅ

Fa: ふぁ, Fe: ふぇ, Fi: ふぃ, Fo: ふぉ, Fu: ふ

Va: ヴぁ, Ve: ヴぇ, Vi: ヴぃ, Vo: ヴぉ, Vu: ヴ

Si: すぃ

Breath: 息
 

Some of these might not be exactly the same from voicebank to voicebank, but these should be generally just about right.

When you finish oto-ing and aliasing everything, your second screen should look something like this:

(If you want, you can put comments for some things, but I’ve generally not found much use for that.)

If you just open up the oto.ini file in, say, Notepad, it should look something like this:

Notice how I have three entries for my breath.wav. This is because there are multiple ways to alias the breath sound, so to do that, I just copy-pasted the entry in the oto.ini file on Notepad and changed the alias. The aliases are all associated with the breath.wav file, so I didn’t need to make multiple copies of my breath file; the oto.ini can associate more than one entry with the same file.

 

And now…. you’re done! You’ve completely created an UTAU voicebank!

bottom of page