For this reason, we have looked into a customer journey that does not start with the browser but with the Google Assistant on smartphones. Basically, the journey is comparable to Google Home or Amazon Alexa.
The user starts with invoice or speech and asks "OK, Google, ask Zalando gift finder about gift suggestions for my mother". Google recognizes that the user is asking for the skill "Zalando" and activates it. Henceforth, the Zalando skill responds to all voice commands, asks a few times and then offers results, shown in tiles, using a guided selling method.
The user must take the smartphone and actively click on the tile if he wishes to order the item. As a result, he is directed to the Zalando shop. Here the journey continues with touching or clicking – therefore, with a physical interaction on the screen of the smartphone. The user can view more pictures of the product or put them in the shopping cart by clicking – nothing new and from the perspective of the user, it's not cool or special. What started with a voice-controlled customer journey turns into a "not really voice-controlled" experience. You cannot speak to the Zalando shop, it does not answer any questions and it does not respond to voice commands.
Hence, we compiled an add-on for Google Chrome that provides the user with a voice interface. The plan: if someone visits the website using Google Chrome and has installed the plug-in; he can continue to browse using his voice. Store operators benefit because they can publish a branded plug-in. However, the disadvantages quickly became evident. In the first step, we used keyword rules to attempt to determine the voice command the user had deployed. After several test runs it quickly became clear that every user speaks differently into his device. A keyword-based recognition became complex incredibly fast and could no longer meaningfully be processed. Consequently, we decided to use the already existing knowledge from the chat bot context and to connect Dialogflow as the NLP (natural language processing) engine. The NLP was compiled and trained with the key intents. In the plug-in everything that had been converted into text from voice by the native Google API was forwarded to the Dialogflow API to determine the intent. As a result, we did not have to conduct a keyword analysis, but outsourced the intent recognition (primarily those that represent the voice commands to the browser) to Google. Hence, our plugin was "ready to launch" in the BETA version. On the desktop PC, everything worked largely without any issues. Now the goal was to also pass the test successfully on a mobile device.
Based on our first use case, in which the user talks to his smartphone and the Google Assistant, we did, however, have to learn that the mobile Chrome browser does not permit the installation of add-ons or plugins. Consequently, this implementation solution works only for desktop users, which may be exciting, but doesn't really lead us to our goal.
Finding 2: The information density in current stores is completely useless for voice. On the one hand, this is due to technical reasons, on the other hand, it is user-related. When asked "What are the product's capabilities?" or "Read the description for me", the respective product copy is read out loud. This may sound cute; however, in practical tests it showed that text and descriptions that go on for more than 3 sentences are simply no longer taken note of and tend to annoy us. The technical hurdles in connection with existing content can be attributed to the fact that the Chrome browser can either "talk" or "listen". In other words, if you have something read to you, the microphone is deactivated. It must be re-activated after reading out. As the reading continues, the user doesn't have any option to interrupt the reading process via voice command. Hence, copy for reading out of product features and descriptions will likely have to be additionally created and provided. The standards for this type of content are high, since the expert and important information has to be tailored to the unique features of the acoustic channel. The simple abridging of existing copy will likely not be a solution.
Finding 3: On the PC, a version of a "hybrid work mode" turned out to be true fun and made sense. In many cases, we thus used the voice and mouse interaction in combination with each other. For instance, during the first step of the checkout process, we were able to say "My name is Alexander Kaepple, I live in..." or "My phone number is..." – the NLP engine recognized this information in the background and entered into the store's fields. At the same time, we were also able to use the mouse to scroll, check the GTC or enter a password. The user experience was amazingly positive because we did not have to type in our full address and name information. We also had a comparable pleasant experience on the product detail pages, because the content had already been optimized. In response to "What are the unique features of the product", the system read out the unique features, while we were able to scroll with the mouse or checked out additional images.
However, the fun factor could not conceal the obstacles, which turned out to be grave – unfortunately during mobile use.
Obstacle 2: The fact is that we do not actually want to see a tiny picture of a product on the smartphone screen (regardless of whether it is responsively optimized). From our perspective it would be better if we could basically order the browser to "show it to me on my TV". Hence, we tried to project the Chrome browser onto a screen via Chromecast. Unfortunately, this doesn't work like YouTube, where all you have to do is click the button and you promptly see the stream on the Chromecast connected TV. It took several work-arounds (which is definitely not user-friendly) to stream the entire smartphone screen and thus display the web store on the TV. Well, we had taken it one step further. A quick digression at this point: The landscape view is far more suitable for the depiction on large 16:9 screens that the portrait view.
After we had determined that the smartphone screen lock turns off the screen and with it the microphone, we began to search for a resolution for this problem. We have already looked at existing solutions and workarounds which prevent a screen lock. Innovations such as NoSleep.js or react-wakelock exist; however, what's annoying is the core feature of these workarounds, which is to abort or prevent any HTTP request. This would have rendered our intent and speech recognition via Google API useless. Hence, the workaround "Prevent Screen Lock" unfortunately will not work. Indeed, it would probably be possible to configure the smartphone in such a way that the screen remains active all the time (and along with it, the microphone), but you can't explain to a user in a practical way why this should even be necessary. After all, who would ever reconfigure their phone just to visit a specific website? Nobody.
The announcement of a version of Wake-Lock as a HTML5-API appears to be a possible light at the end of the tunnel. However, right now, this wake-lock is not really supported by mobile browsers. Its integration will once again have to be accomplished via Chrome App, i.e. as a plugin – and we have already learned that plug-ins cannot be installed on mobile Chrome. Hence, this appears to be a dead end right now.
To get closer to our goal – a seamless voice-customer journey – we will not be able to bypass the creation of an app. This App can behave in various ways. Either no screen lock – or if it cannot be avoided – at least no microphone deactivation. Beyond this, the app could use the data from the store to be able to display images and other content until the checkout process begins. An alternative option from our perspective would be to merely create an application logic, which is used as an SDK by other companies and software developers to provide access to the voice control. Unfortunately an app-like solution would once again require that the user has installed this app. And if one considers what would happen in such an app at the end of the day, one also has to ask oneself why the entire shopping process has not already been realized in the form of a skill via an existing assistant solution like Google Assistant or Alexa.
While it is currently not possible to stream the screen of the Google Assistant onto a TV screen, who knows, it might become available sooner than we think. And that would be the latest point in time to provide one's product data and content in such a manner that even users who do not have a screen to navigate, browse and shop the company skill instead of the web store.
Is there still a need for a voice-controlled mobile shop? Probably not, given that surfing and ultimately shopping will be realized via voice using the assistant skill. Going forward, the aim will have to be to transport the emotions of the product through voice and suitable other media so that the sale can be completed even if a screen is not available.