AI can beat humans in the games of Chess, Jeopardy, and Go. Recently, I’ve also discovered that AI systems can also beat 70 human testers at their own game — testing. I spend my days coercing machines to execute test cases with techniques such as neural networks, clustering, and reinforcement learning. It is a lot of work, but it is somewhat mechanical and obvious that with AI driving cars around town and playing Atari video games these days, the AI should also be able to execute a test case or two for us.

Testers continually ask me if their jobs are safe, and I’ve usually said yes for three reasons:

  1. Human minds are still writing the actual test cases today and the AI just executes it for us
  2. When I’m asked this question, I’m typically surrounded by 50 to 1000 testers and I do value my life
  3. There have always been areas of testing that I thought could not be relegated to the machines for a very long time, specifically qualitative assessments of quality

Qualitative assessments of quality are things such as: How easy is this app to use? Does this app look good? Is this screen trustworthy? If we humans, myself included, are confused by some of these qualitative judgments of quality, then surely our jobs would be safe from the machines if we were to focus on these areas.

The Challenge

I co-presented a full-day tutorial on the topic of AI and machine learning for testers. The room had 70 testers in the room. These testers are no ordinary testers, these were professional, technical testers who work in roles where their company can afford to send them off to Disneyland for a week of training. These testers were also confident enough in themselves to brave a full day learning about AI and machine learning algorithms. The room was full of great testers.

Let’s pretend you are in that room with us and see if the AI can beat you too.

I asked the audience, and now I’m asking you, a qualitative testing question:

Question: If you were looking at screenshots of login pages, how would you tell if the login page is trustworthy or not? How would you rate the page’s trustworthiness?

Think about it for a moment…

Samples of Untrustworthy and Trustworthy Login Pages

Take your time and look at some of the example login screens above.

What do you think?

The Results: 70 Experts

The other 70 people in the room took their time to think about this one too. There were no quick answers. A general ‘hmmmph’ sounded in the room. To make folks feel better about drawing a blank, I said that I originally had no idea either. I encouraged folks to keep thinking. A woman in the front row (who’s son works on AI at Google), ventured a guess and said: “foreign languages”. If the login page was in the US app store, but it had non-English words in it, it could be viewed as less trustworthy to a user because they wouldn’t know what the app was telling them. A great start, but only after some 3 minutes of 70 human minds thinking in parallel.

After a few more moments, a second hand went up, and this gentleman suggested that if the login had a well trusted and recognized brand name or logo, the login page would be more trustworthy. If the login page showed the Google or Microsoft logo, then it is probably more trustworthy than the average app. If the login page had a logo the user had never seen or heard of before, it would be seen as less trustworthy.

At this point, we burned perhaps 70 * 5 minutes of human compute time, which is about 6 hours of top human testing brain power trying to answer this question. And yet, we only have 2 ideas on how to measure trustworthiness. We also don’t have a method of scoring exactly how trustworthy or untrustworthy a login screen is — just some ideas on what might influence that qualitative judgment of quality.

The Results: AI

OK, so now we timeout on the human minds because they can be fickle, get impatient and bored. How the did ‘AI’ do? Previously, at world headquarters, we had a few non-technical humans simply look at a large number of login screens and rate the trustworthiness of the pages from 1 to 10. Then, we simply trained a neural network with this data. What was the AI’s answer to this question of how to determine the trustworthiness of a login page? We peeked inside the neural network weights and found the AI’s answer to the question was:

  1. Foreign characters — If the login page has foreign characters or words in it, it is less trustworthy.
  2. Brand recognition — If the login page has a popular/recognizable brand image, it is more trustworthy.
  3. Number of elements — If the page had more elements, it was less trustworthy. The fewer elements on the page, the more trustworthy.

The machines beat the humans! Not only did the AI discover an additional aspect of the application that correlates to trustworthiness (#3), but the AI also gave a precise score of how trustworthy or untrustworthy a login page is on a scale of 1 to 10.

Interestingly, the trained AI was smarter than any single human that labeled the data. The neural network effectively learned the collective intelligence of all humans that provided the training data. For example, if someone was fluent in Chinese and familiar with Chinese software, they might not have viewed those login pages as untrustworthy. The AI is smarter than any single human.

The AI was also more reliable than the humans. Remember, we asked 70 great testers this question. Given this sample, a company only has a 2 in 70 (2.8%) chance of having a tester on their staff that could answer this question. Even then, no single human figured out more than one aspect of trustworthiness, where the AI discovered 3 aspects.

Even more so, the AI reflects how ‘real’ users view trustworthiness. The human testers are trying to emulate that assessment indirectly with empathy — but they are not the actual target user. How much better is it to have the oracle be real-world users versus a tester trying to reverse engineer and guess at end-user qualitative assessments of quality?

The AI is also a whole lot less expensive, reusable for all app teams, and faster than humans. The trained AI takes about 200 milliseconds to measure trustworthiness.

AI Can Teach Us Too

This was a day that AI beat 70 humans (and likely you) at what was commonly viewed as one of the tougher problems in testing — qualitative assessment. It is common for people to dismiss the advance of AI because they know that their particular profession would be difficult for AI to reproduce. But the reality is that AI can often do a better job than humans in many professional fields including software engineering and testing. Testers will soon join the ranks of radiologists, lawyers, and truck drivers who have to deal with the uncomfortable truth that AI might just better than they are at their job.

AI beating humans in software testing tasks — some of the most difficult tasks, was fascinating. But after some time considering what had happened, this was also the first time an AI had taught me (and you!) something about testing and quality. We both now know that if we have more elements, check boxes, text, icons etc. on a login page, it will likely be viewed by users as less trustworthy. You didn’t know that a few minutes ago. If we hadn’t looked inside the AI’s brain, I’m not sure any human would know that today. Humans are now learning from AI and that might be more significant in the long run than them beating the humans at our own software testing games.

— Jason Arbon, CEO at