Vocaloid 4 is the current version of Yamaha’s well-known voice synthesis software, which is available as a standalone Windows application, or as a Cubase plugin for PC and Mac systems. Cyber Songman and Cyber Diva are two recent English voice banks for Vocaloid 4, also developed by Yamaha. Both are English-language voices created from recordings of native speakers of American English.
Vocaloid has been around for more than 15 years, having started out as a research project at a university in Spain. There have been dozens of Vocaloids (voices which work in Vocaloid software) released by various developers over the years. Hatsune Miku is the most successful one to date, which is why many people associate Vocaloid in general with her attributes – cuteness, a marketing empire built around the character, a high-pitched voice, anime-style design and Japanese language. Cyber Diva and Cyber Songman are quite far from that, being pretty much singing tools. They don’t have any character design besides the box images and only sing in English. With both of these products, I get the impression that the primary intended users are not Vocaloid fans, but more mainstream producers.
Meet The Androids
Both Cyber Songman and Cyber Diva have pop/rock singer voices – not opera or metal. Cyber Songman is a smooth and warm voice, not as high as most male pop singers on the charts today, but somewhere in Chris Brown or Jason Aldean territory. Cyber Diva has a voice that’s on the high side by female singer standards, delicate and slightly cute. Not anime girl levels of cuteness, but a little bit in that direction. With a slight adjustment to the Gender parameter, she can sound more smooth and mature and close to Julie London.
Their English when using their user dictionaries has a standard American accent – like someone from Iowa or thereabout. Both virtual singers tend to pronounce things more clearly and correctly than humans do when singing, so some editing of the phonemes is useful for making them sound smoother. Changing the phonemes is also needed to make them sing with a regional accent. Cyber Songman is a more recent release and includes a few extra American phonemes in his data, as well as a different dictionary. His English is generally more natural, with the main flaw being that he sustains the consonant “r” longer than most people would – for example in the word “fighter” of this demo song.
Using phonetics to sing non-English languages with an American accent works, although it easily gets glitchy and obviously robotic when trying consonant combinations which don’t normally exist in English. Latin with a heavy American accent (like a singer in a Hollywood movie score) works quite well, though.
Both also support the Growl parameter, added to Vocaloid in version 4. It works very nicely with Cyber Songman. It won’t quite turn him into a Louis Armstrong or Axl Rose soundalike, but it works for a moderately rough rock or blues voice, and is also great for adding expression. Tweaking both Growl and Gender together can push things into Barry White territory. With Cyber Diva, I didn’t like the results as much. Raising Growl for short periods of time works very well for adding color and expression, but keeping it at high values throughout a whole song sounds kind of funny – like a singer who’s trying to sing smoothly but needs to clear her throat, and not a singer intentionally singing hoarsely.
Vocaloid sticks pretty closely to what a real human singer could do – no polyphony within a single part, and a four-octave range, although the recommended range is a narrower A1-A3 for Cyber Songman and G2-C4 for Cyber Diva. Near the top of their recommended range, both voices start to sound thin and nasal, like many human singers would, and not unrealistically strong and clear. Singing a little below their recommended range sounds fine to me, as well. In general, pushing the vocals into abstract, inhuman or extreme territory – chopped vocals, swirly pads or dubstep basses – isn’t possible. There is a cross-synthesis feature which creates a hybrid of two voice banks, but I was not able to test it because it is not supported with English-language Vocaloids. If sounding robotic is the goal, that’s quite easy – keeping the dynamics flat and note velocities high will do the trick, and making a phrase staccato and adding a steady level of growl will make things even more robotic.
Control The Androids
Fortunately for me, you don’t need to know any Japanese in order to use Vocaloid. I was able to download, install, register, update and use Vocaloid without running into any language barriers. The only Japanese text I ran across was on the upload page, which fortunately had a clearly labeled button with the word “Download” in English. After a few hours of poking around and trying various features, I was able to make the following video which shows the effort and time required to get eight bars of intelligible and expressive vocals.
Vocaloid Editor works more or less like a DAW, with support for multiple vocal parts, a piano roll, control parameters which more or less work like MIDI CC, and note properties. It does not support VST instruments, and backing tracks can instead be imported as WAV files. There’s no MIDI keyboard or controller input support, so if you want to play your vocal lines using a keyboard, that will have to be done elsewhere and then imported into Vocaloid. VST effects are supported with some limitations, however their control parameters cannot be automated and there are no mixer sends. This basically means that exciters, compressors and similar effects which usually need one setting per track can be used just fine inside Vocaloid, but if you want to automate wet/dry levels, high-pass reverbs, turn delays on and off etc., all that will have to be done in your main DAW.
In general, just punching in the notes and lyrics gives good results for simple scores. Words are pronounced intelligibly and the timing is good. In reality, singers will normally start syllables which begin with consonants before the written note, so that the syllable’s vowel starts when the note actually starts as written. Vocaloid does an excellent job of automatically compensating for this, with consonant length controlled by MIDI velocity. With more complex scores which include larger numbers of connected notes, there’s a little more work required in order to get the right part of the syllable sustained. For example, trying to stretch the second syllable of “blooming” across a few ornamented notes, I had to tell Vocaloid to sustain the “oo” and not the “n”.
Several control parameters are available, with Dynamics being the most important one. Between that, the Breath and Growl parameters, as well as vibrato, it’s possible to get a sound which is quite expressive. Vibrato and legato style are not controlled by global parameters, though, and instead are note properties. When it comes to legato, this is not too bad, as long as you remember to switch to the correct style when entering the notes, but setting a few notes in the middle of a phrase to legato requires some fiddling. Adjusting vibrato also gets time-consuming, especially when trying to make an expressive line with several long notes. There are some third party job plugins (Lua scripts, in essence) that can be used to reroute the pitch bend and bend speed parameters to control vibrato, which is a sign that this is something that could be improved. Coaxing a natural and expressive performance using these parameters and properties takes some time, skill and effort – definitely more than working with a human singer who can nail a song in two-three takes, but definitely less than those singers who need lots of takes, or whose backing vocals need a lot of time-aligning.
Stacking multiple parts for backing vocals is handled very efficiently. There are job plugins to randomize timing and detune parts, and with some variation in dynamics that can sound very convincingly like one singer overdubbing multiple parts with a natural degree of variation. Varying the Gender parameter between parts turns this sound into a choir of different-sounding people, although these particular Vocaloids will never sound like a gospel or epic Hollywood choir.
For the most part, Vocaloid is quick to work with once you’re used to it, and has some nice features which really help with that, such as customizable keyboard shortcuts. Even if you’re not used to it, it’s not as time-consuming as other voice synthesizers I’ve tried, mainly because it really does a great job of timing note start points. However, its integration into a typical workflow is not so smooth, because it’s standalone software. The core problem is that the MIDI protocol and DAWs are not really all that well suited to the way human languages work, so anything that synthesizes or edits vocals is never going to be quite as easy to use as an instrument in VST (or other) plugin format. Still, a VST plugin version with some more control parameters would be a definite step forward in usability.
Even though Vocaloid Editor is a 32-bit application, it’s very efficient at what it does, and also very stable. Perhaps being simpler than a full-featured DAW helps. CPU usage with one voice singing one part is around 15-20% on my machine, and stacking six parts together increased that only to about 25%.
Breaking Through The Flesh Ceiling
If you want a synthesized voice that sounds intelligible and expressive, both Cyber Songman and Cyber Diva deliver the goods. They’re still not quite as expressive as a human singer who can go from a whisper to a scream, but they are good enough to replace human singers in many cases. For short, simple jobs requiring 15-30 seconds of vocals, such as commercial jingles or last-minute additions of some “aah” backing vocals, they can be much faster and more efficient than organizing a recording session with a human singer. For demo tracks which are to be replaced by a human singer, their intelligibility and natural timing make them very useful tools, and even after being replaced they might still be usable as backing vocals.
Aiming even higher, I think it’s possible (though unlikely) for a song using one of these Vocaloids as a featured lead vocal to show up in the Billboard Hot 100. Cyber Songman has better odds of getting there than Cyber Diva, simply because he sounds more natural. I would not be surprised if they both showed up on a big hit as backing vocals, though. Lead vocals on the Dance/Electronic chart are a strong possibility, as Porter Robinson has already done that using the Zero-G Vocaloid Avanna on “Sad Machine”. So, Vocaloid has reached a level of usability at which being the lead vocal on an electronic hit or backing vocals on a pop hit is a reasonable expectation, and lead vocals on a pop hit would be a major success.
Vocaloid Editor, Cyber Songman and Cyber Diva cost ¥10,000 (around $90) each plus tax, with discounted starter packs available.
More info: Vocaloid
Vocaloid 4 Review
If you want a synthesized voice that sounds intelligible and expressive, both Cyber Songman and Cyber Diva deliver the goods. They're still not quite as expressive as a human singer who can go from a whisper to a scream, but they are good enough to replace human singers in many cases.