We have previously written a detailed guide about chatbots. Then we started to develop objective metrics for measuring the performance of a chatbot. We also shared some of the success stories since success stories are rare and ambitious designers of conversational interfaces need to study them because, for every hundred of failures, there are only a few success stories. We provided a basic guide for a/b testing. In this article, we will be sharing some of the key methods for chatbot testing.
Chatbots are indeed revolutionizing the interaction between organizations and individuals, but one thing still lacks, the industry still hasn’t been able to achieve is standardizing the chatbot testing. We rather observe the performances indirectly. Claims such as 10 times better ROI compared to email marketing makes sense only if the chatbot is implemented. Although various metrics for performance has been developed, the nature of chatbot makes it hard to develop a uniform score or test for all types of chatbots
There are various reasons for that, one reason is that the chatbots heavily depends on the aim they are developed for. Users of chatbot don’t act that way. For e-commerce developers expect the user to be aware of what they are chatting with. Therefore, the bot needs to keep the user engaged on that specific topics. It would be hard to test a chatbot specialized in e-commerce or more specifically a chatbot that specialized in clothes wouldn’t provide the same agile responses that of a chatbot developed for shoes.
Standardized test lacks this property. Rather the funnel for testing should follow an inverse pyramid shape. Question and answer testing would follow that structure. That way Q&A should start with broad questions that even the simplest chatbot is expected to answer. This is the part where the chatbot greets and welcomes the user. Answers the simple questions. That type of dialogue is the one we expect to see when a salesperson approaches us. Try to get to know the other person in a most basic way.
If the chatbot fails the general test, then the other steps of testing wouldn’t make any sense. Chatbots are expected to keep the conversation flowing, if they fail at the first stage then, the user will likely to leave the conversation. Effectively damaging the metrics such as conversation rate and bounce rate.
Domain Specific Test
The second stage would be testing for the specific product or service group. The language and expressions related to the product will be the main driving force for the test. One example is the usage of the word for a digital marketing related vendor retention would mean completely different thing than for a government agency chatbot. Therefore, testing questions should be categorized. It is almost impossible to capture every specific type of question related to that specific type of product since there is a continuum of products and services, but it would be possible to provide broad classes just like the classes we provided in AIMultiple.com.
This context related question will be the ones that drive the consumer to buy the product or the service. Once the welcome and greet part is over, the rest of the conversation will be about the service or the product. Therefore, after the initial contact and main conversations, chatbots need to ace this part or attain the maximal correct answer ratio.
The third stage would be testing the limits of our chatbot. For the first two steps, we assumed regular expressions and meaningful sentences. This last step will show us what happens when the user sends irrelevant info and the how the chatbot would handle it. That way, it would be easier to see what happens when the chatbot fails.
Those three steps would be the most basic steps before releasing the chatbot. It captures the key points of chatbots and would enable a company to pinpoint problems before using the chatbot. This beta test like procedure guarantees to capture the most crucial sides of the chatbot testing.
Pyramid approach comprises of three broad steps;
- General Case Test
- Domain Specific Test
- Limit Test
Chatbottest’s Standardized Test
Currently, the standardization of the test is a bit problematic. One major project for standardization is Chatbottest. The project provides a database 120 questions to test the chatbot and the user experience for free.
The concept they developed follows a Gaussian nature. The test mechanism developed broadly follows three categories. Expected scenarios, possible scenarios, and almost impossible scenarios. This scenario testing structure can be mapped to sigma distances.
Empirically, after testing for almost impossible scenarios which can be considered as the 3-sigma distance, the chatbot performance would be observed for 99% confidence interval. It would be costly to test further. Since the human capability of producing sentences has no upper bound.
7 Key Test Metrics
Through their scenario testing, Chatbottest provides, test for 7 broad categories
- Personality: Does the chatbot have a clear voice and tone that fits with the users and with the ongoing conversation?
- Onboarding: Are users understanding what is the chatbot about? and how to interact with him from the very beginning?
- Understanding: Requests, Smalltalk, idioms, emojis… What is the chatbot able to understand?
- Answering: What elements does the chatbot send and how well it is doing it? Are they relevant to the moment and context?
- Navigation: How easy is to go through the chatbot conversation? Do you feel lost sometimes while speaking with the chatbot?
- Error management: How good is the chatbot dealing with all the errors that are going to happen? Is able to recover from them?
- Intelligence: Does the chatbot have any intelligence? Is able to remember things? Uses and manages context as a person?
Dimon is another platform that helps you with chatbot testing. Dimon enables bot and chatbot owners to identify and fix issues in their bot conversations. Dimon has integration with major platforms such as Slack, Facebook Messenger, Telegram, and WeChat.
Botanalytics also provides a custom service for chatbot testing. Bottesting.co designed to help you with your bot through conversation, usability, user experience, user metrics.
Human Testing of Chatbots
If the goal is to improve the performances further, there are other methods. One way is using Amazon’s Mechanical Turk which operates a marketplace for work that requires human intelligence. The Mechanical Turk web service enables companies to programmatically access this marketplace and a diverse, on-demand workforce. Developers can leverage this service to build human intelligence directly into their applications. This service can be used for further testing and reach for a higher confidence interval.
Many of the key metrics and key test mentioned in this article are broad test categories. It is possible to test further and generate ad-hoc categories and methods, but it is important to note that chatbots are bots. No matter how hard the people try, at the current stage, chatbots have limits. So, expecting a human-like performance is expecting a god-like performance from a human. It happens every once in a while but doesn’t happen overnight or all the time.
Agile development is still the key to success. Even after the chatbot is released, the process continues. Feedback is the most essential element to shape the chatbot. A beta test is a beta test, they provide you the performance details for in-sample properties. Real life performance should be monitored closely to keep the chatbot versatile and robust.
Greater chatbot integration is expected. Scalability and cost-effective nature of chatbots would further spur the growth. Marketing activities and many of the selling processes can be handled through the chatbots. For that reason, objective testing of chatbots will be highly critical for further growth.
While testing standardized tests gives us chance for comparison, it is important to be aware of the problem with the standardized tests. Goodhart’s law states that once a social or economic measure is turned into a target for policy, it will lose any information content that had qualified it to play such a role in the first place. Therefore, keeping the testing process as dynamic as possible will make the whole testing process more meaningful and would provide antifragility for the chatbot.
Are you looking for an AI solution? Let us know. We can find the best AI partner for your business.