Emerging technology: Conversational user interfaces

August 23rd, 2019, Published in Articles: EngineerIT

While most of us dislike talking to machines, the era of conversational user interfaces is here. How often don’t we mutter “I really would like to speak to a real person.” Well, don’t be surprised if the answer comes back, “I maybe a bot but I can answer your questions and solve your problems. Try me!”

The technology is emerging but it may be a while before it will be widely introduced. The commercial world is not quite ready to make it real. Yes, we can ask Alexa about tomorrow’s weather and because of all the connection interfaces, the system will be able to identify the location you are calling from and call up the latest weather report, but still in a somewhat cold computer voice.

A conversational user interface is a touchpoint that enables the use of language to interact. This finds justification in the fact that conversational UI has become far more refined than its initial days.

There are two main types of conversational UIs in use at present – chatbots and virtual assistants. Of these, most business websites prefer chatbots. Designing a conversational UI requires a departure from the usual design approach in that the design elements are minimal and the primary focus is on the development of words for the bot’s conversation with the user. Hence, the conversation content and flow are what needs to be worked on.

Today’s most advanced conversational UI comes from the Google stable; it is called Google Duplex. The technology is both impressive and a bit on the creepy side. Google featured a human-sounding robot having a conversation with a person who couldn’t even tell that they were talking to a robot. The demo during Google I/O 2018 freaked some people out because it was so real.

A long-standing goal of human-computer interaction has been to enable people to have a natural conversation with computers, as they would with each other. In recent years, we have witnessed a revolution in the ability of computers to understand and to generate natural speech, especially with the application of deep neural networks such as Google voice search, WaveNet. But even with today’s state-of-the-art systems, it is often frustrating having to talk to stilted computerised voices that don’t understand natural language. In particular, automated phone systems are still struggling to recognise simple words and commands. They don’t engage in a conversation flow but force the caller to adjust to the system instead of the system adjusting to the caller.

With Google Duplex, Google has taken conversational UI to a new level: a new technology for conducting natural conversations to carry out “real world” tasks over the phone. The technology is directed towards completing specific tasks, such as scheduling certain types of appointments. For such tasks, the system makes the conversational experience as natural as possible, allowing people to speak normally, like they would to another person, without having to adapt to a machine.

One of the key research insights was to constrain Duplex to closed domains, which are narrow enough to explore extensively. Duplex can only carry out natural conversations after being deeply trained in such domains.

The Google Duplex technology is built to sound natural, to make the conversation experience comfortable. There are several challenges in conducting natural conversations: natural language is hard to understand, natural behaviour is tricky to model, latency expectations require fast processing, and generating natural sounding speech, with the appropriate intonations, is difficult.

When people talk to each other, they use more complex sentences than when talking to computers. They often correct themselves mid-sentence, are more verbose than necessary, or omit words and rely on context instead.

The system also sounds more natural thanks to the incorporation of speech disfluencies (e.g. “hmm”s and “uh”s). These are added when combining widely differing sound units in the concatenative text to speech (TTS) or adding synthetic waits, which allows the system to signal in a natural way that it is still processing. (This is what people often do when they are gathering their thoughts.) In user studies, Google found that conversations using these disfluencies sound more familiar and natural.

Also, it’s important for latency to match people’s expectations. For example, after people say something simple, e.g., “hello?”, they expect an instant response, and are more sensitive to latency.

Gartner release of the 2019 Top 10 Strategic Technology Trends said that conversational platforms have reached a tipping point. The usefulness of the systems has exceeded the friction of using them. But they still fall short. Users need to know which domains the UI understands and what its capabilities are within those domains. The challenge that conversational platforms face is that users must communicate in a very structured way. This is often frustrating. Rather than enabling a robust two-way conversation between the person and the computer, most conversational platforms are mainly one-directional query or control systems that produce a very simple response. Over time, more conversational platforms will integrate with growing ecosystems of third-party services that will exponentially drive the usefulness of these systems. The best conversational platforms will be differentiated by the robustness of their conversational models and the API and event models used to access, invoke and orchestrate third-party services to deliver complex outcomes.

Through 2020, application vendors will increasingly include conversational platforms in packaged applications, the Gartner report said. They will do so to maintain a direct channel to their users, rather than being intermediated by a conversational platform they don’t control. Gartner analysts expect ongoing battles between application vendors and providers of general-purpose conversational platforms through 2023, followed by consolidation in the market.

For now, we may have to live with a repeated response, “I don’t understand your question”.

Related Articles

  • Blockchain expected to be transformational
  • IoT in agriculture will boost productivity
  • ATEX absolute encoder with Profibus interface
  • Recognised by leading chip maker
  • Gas detector for multiple applications