Voice-as-UI: Where we stand and what we know

 "Over the long term, webmasters should ponder
how they can be successful with their website if
the user is no longer glued to the screen."

John Mueller interview by onlinemarketing.de

voice-as-ui | header | diconium

Voice-as-UI: Where we stand and what we know

We have arrived at the exact same conclusion, which is why we have been working on the subject Voice-as-UI since mid 2017. If there is no screen anymore, how will the behavior of the users change and how should we design the accessibility to information and to online-shopping in the future. The question as to how today's consumers interact with Google Assistant, Alexa or Siri must not only be examined from the SEO perspective, but primarily from the user prospective and we are wondering how this behavior will change. How far does the website or store operator have to go to meet the users' expressly high expectations? 

For this reason, we have looked into a customer journey that does not start with the browser but with the Google Assistant on smartphones. Basically, the journey is comparable to Google Home or Amazon Alexa.

The user starts with invoice or speech and asks "OK, Google, ask Zalando gift finder about gift suggestions for my mother". Google recognizes that the user is asking for the skill "Zalando" and activates it. Henceforth, the Zalando skill responds to all voice commands, asks a few times and then offers results, shown in tiles, using a guided selling method.



voice-as-ui | ablauf | diconium

The user must take the smartphone and actively click on the tile if he wishes to order the item. As a result, he is directed to the Zalando shop. Here the journey continues with touching or clicking – therefore, with a physical interaction on the screen of the smartphone. The user can view more pictures of the product or put them in the shopping cart by clicking – nothing new and from the perspective of the user, it's not cool or special. What started with a voice-controlled customer journey turns into a "not really voice-controlled" experience. You cannot speak to the Zalando shop, it does not answer any questions and it does not respond to voice commands.


Why not?

We considered this question, because actually we wanted a "seamless" customer journey. Amazon Alexa does a good job but unfortunately, the experience is product-focused. And even it is always limited to one product in the first recommendation. While it is possible to voice input orders as a great touch, even Alexa's visual presentation skills are quite limited. Hence, we wondered what would have to happen if we were to start the customer journey in the voice context and then are redirected to a web store because of the Google or Amazon logic. The answer was obvious - the store has to be able to accept voice commands, to navigate intuitively and to advise by means of speech. We went to work and evaluated different options. In the process, we noticed that the Google Chrome browser supports voice recognition. This implies that with reasonable effort one can compile a JavaScript, which recognizes the user voice commands and, through already built-in interfaces, converts these commands into text for Google Services and responds. 



Hence, we compiled an add-on for Google Chrome that provides the user with a voice interface. The plan: if someone visits the website using Google Chrome and has installed the plug-in; he can continue to browse using his voice. Store operators benefit because they can publish a branded plug-in. However, the disadvantages quickly became evident. In the first step, we used keyword rules to attempt to determine the voice command the user had deployed. After several test runs it quickly became clear that every user speaks differently into his device. A keyword-based recognition became complex incredibly fast and could no longer meaningfully be processed. Consequently, we decided to use the already existing knowledge from the chat bot context and to connect Dialogflow as the NLP (natural language processing) engine. The NLP was compiled and trained with the key intents. In the plug-in everything that had been converted into text from voice by the native Google API was forwarded to the Dialogflow API to determine the intent. As a result, we did not have to conduct a keyword analysis, but outsourced the intent recognition (primarily those that represent the voice commands to the browser) to Google. Hence, our plugin was "ready to launch" in the BETA version. On the desktop PC, everything worked largely without any issues. Now the goal was to also pass the test successfully on a mobile device.

Based on our first use case, in which the user talks to his smartphone and the Google Assistant, we did, however, have to learn that the mobile Chrome browser does not permit the installation of add-ons or plugins. Consequently, this implementation solution works only for desktop users, which may be exciting, but doesn't really lead us to our goal.

Given that Chrome plug-ins are primarily compiled in JavaScript, it appeared to make sense to directly implement the already existing construct of voice and intent recognition in the form of a JavaScript library in the source code of a web store. This allowed us to by-pass the plugin problem of the mobile Chrome browser while not having to start from scratch.



We therefore used the code of the plug-in and extracted the JavaScript from it. It can subsequently be integrated into the source code of a website or a store without any problems and thus ensures both, voice and intent recognition.

Finding 1: Without SSL nothing works. We had ignored this entire issue during the plugin compilation, since the plugin really doesn't care whether the user is on an encrypted site or not.  However, the JavaScript and the connected Google APIS do care.  We adapted our library and subsequently integrated a store test system (which also works on an SSL-encrypted domain) and started to test it. As a result, we discovered other interesting insights, but unfortunately also obstacles.

Finding 2: The information density in current stores is completely useless for voice. On the one hand, this is due to technical reasons, on the other hand, it is user-related. When asked "What are the product's capabilities?" or "Read the description for me", the respective product copy is read out loud. This may sound cute; however, in practical tests it showed that text and descriptions that go on for more than 3 sentences are simply no longer taken note of and tend to annoy us. The technical hurdles in connection with existing content can be attributed to the fact that the Chrome browser can either "talk" or "listen". In other words, if you have something read to you, the microphone is deactivated. It must be re-activated after reading out. As the reading continues, the user doesn't have any option to interrupt the reading process via voice command. Hence, copy for reading out of product features and descriptions will likely have to be additionally created and provided. The standards for this type of content are high, since the expert and important information has to be tailored to the unique features of the acoustic channel. The simple abridging of existing copy will likely not be a solution.

Finding 3: On the PC, a version of a "hybrid work mode" turned out to be true fun and made sense. In many cases, we thus used the voice and mouse interaction in combination with each other. For instance, during the first step of the checkout process, we were able to say "My name is Alexander Kaepple, I live in..." or "My phone number is..." – the NLP engine recognized this information in the background and entered into the store's fields. At the same time, we were also able to use the mouse to scroll, check the GTC or enter a password. The user experience was amazingly positive because we did not have to type in our full address and name information. We also had a comparable pleasant experience on the product detail pages, because the content had already been optimized. In response to "What are the unique features of the product", the system read out the unique features, while we were able to scroll with the mouse or checked out additional images.

However, the fun factor could not conceal the obstacles, which turned out to be grave – unfortunately during mobile use.

Obstacle 1: On the mobile device, the browser is far more sensitive in terms of voice recognition. We were able to dictate entire sentences to the PC – while the smartphone shut down the microphone after the first word. While an adaptation of the JavaScript resolved this hurdle, the use continued to be difficult in noisy environments because of the highly sensitive microphone. This did not happen because the device recognized too much, but because the background noises made it hard for the phone to determine when the user had stopped talking. The device simply continued to listen, did not recognize any other commands and the noise in the background resulted in the voice recognition not being completed.

Obstacle 2: The fact is that we do not actually want to see a tiny picture of a product on the smartphone screen (regardless of whether it is responsively optimized). From our perspective it would be better if we could basically order the browser to "show it to me on my TV". Hence, we tried to project the Chrome browser onto a screen via Chromecast. Unfortunately, this doesn't work like YouTube, where all you have to do is click the button and you promptly see the stream on the Chromecast connected TV. It took several work-arounds (which is definitely not user-friendly) to stream the entire smartphone screen and thus display the web store on the TV. Well, we had taken it one step further. A quick digression at this point: The landscape view is far more suitable for the depiction on large 16:9 screens that the portrait view.

voice-as-ui | smartphone | diconium



After we had determined that the smartphone screen lock turns off the screen and with it the microphone, we began to search for a resolution for this problem. We have already looked at existing solutions and workarounds which prevent a screen lock. Innovations such as NoSleep.js or react-wakelock exist; however, what's annoying is the core feature of these workarounds, which is to abort or prevent any HTTP request. This would have rendered our intent and speech recognition via Google API useless. Hence, the workaround "Prevent Screen Lock" unfortunately will not work. Indeed, it would probably be possible to configure the smartphone in such a way that the screen remains active all the time (and along with it, the microphone), but you can't explain to a user in a practical way why this should even be necessary. After all, who would ever reconfigure their phone just to visit a specific website? Nobody.

The announcement of a version of Wake-Lock as a HTML5-API appears to be a possible light at the end of the tunnel. However, right now, this wake-lock is not really supported by mobile browsers. Its integration will once again have to be accomplished via Chrome App, i.e. as a plugin – and we have already learned that plug-ins cannot be installed on mobile Chrome. Hence, this appears to be a dead end right now.



To get closer to our goal – a seamless voice-customer journey – we will not be able to bypass the creation of an app. This App can behave in various ways. Either no screen lock – or if it cannot be avoided – at least no microphone deactivation.   Beyond this, the app could use the data from the store to be able to display images and other content until the checkout process begins. An alternative option from our perspective would be to merely create an application logic, which is used as an SDK by other companies and software developers to provide access to the voice control. Unfortunately an app-like solution would once again require that the user has installed this app. And if one considers what would happen in such an app at the end of the day, one also has to ask oneself why the entire shopping process has not already been realized in the form of a skill via an existing assistant solution like Google Assistant or Alexa.

While it is currently not possible to stream the screen of the Google Assistant onto a TV screen, who knows, it might become available sooner than we think. And that would be the latest point in time to provide one's product data and content in such a manner that even users who do not have a screen to navigate, browse and shop the company skill instead of the web store.

Is there still a need for a voice-controlled mobile shop? Probably not, given that surfing and ultimately shopping will be realized via voice using the assistant skill. Going forward, the aim will have to be to transport the emotions of the product through voice and suitable other media so that the sale can be completed even if a screen is not available.


Also read our whitepaper "Voice-as-UI":

The author


Your contact at diconium

Alexander Käppler
senior digital consultant