Would a text-to-speech synthesizer sometimes mispronounce pitch accent?

According to this video, at least pitch accent-wise, this speech synthesizer isn’t perfect either.

[Human VS Computer](https://youtu.be/gMT_tOZa__g)(Original video [Sugar Buster Anko](https://www.youtube.com/watch?v=mKz6cU8pOqM))

I thought this would definitely be much better than *yukkuri* (?), but now I just can’t imagine how hard it’d be for computers to get the pitch right.

3 comments
  1. You would need a system that does a read through of the entire script, comprehend the context, and retroactively apply emotions and intonations accordingly based on a machine learning engine.

    By the way I think the appeal (if you can call it that) of Yukkuri is it’s flatness and obvious artificiality. It’s easy on the ears compared to something that’s trying too hard to be human.

  2. What could happen before that is, people will learn to speak more from the AI, like how chess players now are starting to mimic AI chess programs and do moves which seem weird to older players.

    If people start hanging out more with AI than with other people, they’ll start mimicking the AI’s speech rather than picking things up from each other. Like how regional accents start to go away once you have mass media.

    If we ignore that, then still probably yes eventually but not yet.

    If you look at how far image synthesis has come in the last few years especially with deep learning, same thing could happen in the sound domain. But will require a lot of training data.

  3. you just need to listen to natives and imitate them. only vowel dominant languages find it hard

Leave a Reply
You May Also Like