Quantifying the Risk of AI Bias

Aug 9

A Testing Perspective of Unwanted AI Bias

Bias refers to prejudice in favor of or against one thing, person, or group compared with another, usually in a way considered to be unfair.

The World is Filled with Bias

Aquick search on bias reveals a list of nearly 200 cognitive biases that psychologists have classified based on human beliefs, decisions, behaviors, social interactions, and memory patterns. Certainly, recent events stemming from racial inequality and social injustice are raising greater awareness of the biases that exist in the world today. Many would argue that our social and economic system is not designed to be fair and is even engineered in a way that marginalizes specific groups and benefits others. However, before we can improve such a system, we first have to be able to identify and measure where and to what degree it is unfairly biased.

Since the world is filled with bias, it follows that any data we collect from it contains biases. If we then take that data and use it to train AI, the machines will reflect those biases. So how then do we start to engineer AI-based systems that are fair and inclusive? Is it even practical to remove bias from AI-based systems, or is it too daunting of a task? In this article, we explore the world of AI bias and take a look at it through the eyes of someone tasked with testing the system. More specifically, we describe a set of techniques and tools for preventing and detecting unwanted bias in AI-based systems and quantifying the risk associated with it.

Not All Bias is Created Equally

While there is definitely some irony in this heading, one of the first things to recognize when designing AI-based systems is that there will be bias, but not all bias necessarily results in unfairness. In fact, if you examine the definition of bias carefully, the phrase “usually in a way considered to be unfair” implies that although bias generally carries a negative connotation, it isn’t always a bad thing. Consider any popular search engine or recommendation system. Such systems typically use AI to predict user preferences. Such predictions can be viewed as a bias in favor of or against some items over others. However, if the problem domain or target audience calls for such a distinction, it represents desired system behavior as opposed to unwanted bias. For example, it is acceptable for a movie recommendation system for toddlers to only display movies rated for children ages 1–3. However, it would not be acceptable for that system to only recommend movies preferred by male toddlers when the viewers could also be female. To avoid confusion, we typically refer to the latter as unwanted or undesired bias.

The AI Bias Cycle

Arecent survey on bias and fairness in machine learning by researchers at the University of Southern California’s Information Sciences Institute defines several categories of bias definitions in data, algorithms, and user interactions. These categories of bias are summed up in a cycle depicted in Figure 1, and can be described as follows:

Figure 1. The Bias Cycle in AI and Machine Learning Systems

Data Bias: The cycle starts with the collection of real-world data that is inherently biased due to cultural, historical, temporal, and other reasons. Sourced data is then sampled for a given application which can introduce further bias depending on the sampling method and size.
Algorithmic Bias: The design of the training algorithm itself or the way it is used can also result in bias. These are systematic and repeatable errors that cause unfair outcomes such as privileging one set of users over others. Examples include popularity, ranking, evaluation, and emergent bias.
User Interaction Bias: Both the user interface and user can be the source of bias in the system. As such, care should be taken in how user input, output, and feedback loops are designed, presented, and managed. User interactions typically produce new or updated data that contains further bias, and the cycle repeats.

Resources on Bias in AI

Interested in learning more about the unwanted AI bias and the bias cycle? Check out these video resources by Ricardo Baeza-Yates, Director of Graduate Data Science Programs at Northeastern University, and former CTO of NTENT. In the first video, Baeza-Yates does a great job of introducing bias and explaining the bias cycle in less than four minutes. In the second video, he takes a deeper dive into the data and algorithmic bias, providing several real-world examples of the different types of bias. Baeza-Yates is clearly an expert in the field and I highly recommend that you check out his Google Scholar Profile for additional resources and publications on this topic.

Bias on the Web: A Quick Introduction with Ricardo Baeza-Yates.

Data and Algorithmic Bias on the Web: A Deeper Dive with Ricardo Baeza-Yates.

During last year’s Quest for Quality conference, I had the pleasure of meeting

Davar Ardalan

, founder and storyteller in chief of IVOW. Her recent post: “AI Fail: To Popularize and Scale Chatbots We Need Better Data” has a list of resources on different topics related to AI bias.

Unwanted AI Bias: A Testing Perspective

Having spent my career studying and practicing the discipline of software testing, it is evident that the testing community has a role to play in the engineering of AI-based systems. More specifically, on the issue of AI bias, I believe testers have the necessary skills to directly contribute to tackling the problem of unwanted bias in AI systems.

Shortly after my friend and colleague Jason Arbon gave a keynote at PNSQC 2019 on testing AI and bias and released a free e-book on the topic, we started brainstorming about what testers can bring to the table today to help with the challenge of unwanted AI bias. Here are the answers that came — testing heuristics for preventing and detecting AI bias, and a quantitative tool for assessing the risk of unwanted bias in AI systems. As life would have it, nearly a year later we’re only just getting around to putting these ideas out into the community.

AI Bias Testing Heuristics

It’s a myth that testers don’t like shortcuts. Tester’s actually love shortcuts — just not the kind of shortcuts that compromise quality. However, shortcuts that take complex testing problems and reduce them into simpler judgments are welcomed with open arms. That is exactly what testing heuristics are — cognitive shortcuts that help us to solve problems while testing software. We provide three types of heuristic-based artifacts to support testing AI for unwanted bias: a set of mnemonics and a questionnaire checklist.

Mnemonics for Testing AI Bias

If you’ve forgotten them, mnemonics are memory tools! just kidding :) But seriously, mnemonics help our brains package information, store it safely, and retrieve it at the right moment. To this day, I still recall many of the mnemonics I learned in math class such as Never Eat Shredded Wheat for remembering the cardinal points and BODMAS or PEMDAS for the order of mathematical operations.

As a starting point for developing a set of practical techniques for testing AI systems, we’ve created seven mnemonics to help engineers remember the key factors associated with unwanted AI bias. These mnemonics are displayed graphically over the AI bias cycle in Figure 2 and are described as follows:

Figure 2: 7 Mnemonics for Testing AI Bias

Mnemonic #1: DAUNTS
The challenge of testing AI for bias can seem daunting and so it is only fitting that DAUNTS is our first mnemonic. It is a reminder of the top-level categories in the AI bias cycle — Data, Algorithm, User iNTeraction, Selection/Sampling

Mnemonic #2: CHAT
Since the 70’s, the real-time transmission of text has been a signature of the Internet. CHATS is meant to remind us of the biases in data sourced from the web — Cultural, Historical, Aggregation, and Temporal.

Mnemonic #3: Culture > Language + Geography
To break the monotony of all the acronyms, this mnemonic is in the form of a math equation. Actually, it’s more of an acronym hidden in an equation when in natural language: Culture is GREATER than Language and Geography. This mnemonic represents all the sub-types of cultural bias which in addition to language and geography include these seven other aspects of humanity — Gender, Race, Economics, Age, Tribe, Education, and Religion.

Mnemonic #4: SMS
We’ve repurposed the well-known mobile acronym SMS to help refine the sampling bias category by indicating the need to check for diversity in data sources and appropriate sampling — Sources, Sampling Method, and Size.

Mnemonic #5: MOV
Reminiscent of both the Quicktime movie file type and the machine instruction that moves data from one location to another, MOV now gives us an easy way to remember the types of selection bias — Measurement, and Omitted Variable.

Mnemonic #6: A PEER
Inspired by the peer-to-peer (P2P) architecture made popular by the music sharing application Napster, A PEER encompasses five key biases — Algorithmic, Popularity, Evaluation, Emergent, and Ranking.

Mnemonic #7: Some People Only Like Buying Cool Products
A set of testing mnemonics would not be complete without a good rhyme that feels a bit random and unscripted. This final mnemonic for the types of user interaction bias is just that — Social, Presentation, Observer, Linking, Behavioral, Cause-Effect, and Production.

It should be noted that the aforementioned sub-categories of bias are heavily intertwined, and do not necessarily fit cleanly into the separate boxes as depicted in Figure 2. The goal is to place them where they have the most impact and relevance in your problem domain or application space.

Questionnaire Checklist for Testing AI Bias

Good testers ask questions, but great testers seem to know the right questions to ask and where they should look for the answer. This is what makes the questionnaire checklist a useful tool for understanding and investigating software quality. Such artifacts provide questions that provoke answers that reveal whether desirable attributes of the product or process have been met.

Based on our experiences testing AI systems, we have created a questionnaire checklist. The goal of the questionnaire is to ensure that people building AI-based systems are aware of unwanted bias, stability, or quality problems. If an engineer cannot answer these questions, it is likely that the system produced contains unwanted, and possibly even liable versions of bias.

Questionnaire Checklist for Testing AI and Bias

AI Bias Risk Assessment Tool

Although the mnemonics and questionnaire are a good start, let’s take it a step further and bring another contribution to the table — one that undeniably spells T-E-S-T-E-R. Surely nothing spells tester better than R-I-S-K. After all, testing is all about risk. One of the main reasons we test software is to identify risks with the release. Furthermore, if a decision is made to not test a system or component, then we’re probably going to want to talk to stakeholders about the risks of not testing.

Risk is anything that threatens the success of a project. As testers, we are constantly trying to measure and communicate quality and testing-related risks. It is clear that unwanted bias poses several risks to the success of AI, and therefore we are happy to contribute a tool for assessing the risk of AI bias.

The idea behind the tool is that, like the checklist questionnaire, it serves as a way to capture responses to questions about characteristics of the data, including its sampling and selection process, machine learning algorithm, and user interaction model. However, as responses are entered into the system, it quantifies the risk of unwanted bias.

Introducing the AI BRAT

Our first version of the AI bias risk calculation tool was a quick and easy Google sheet template. However,

Dionny Santiago

and the team at test.ai have transformed it from template to tool and launched a mobile-friendly web application codenamed AI BRAT — AI Bias Risk Assessment Tool.

To promote learning and application of the heuristics described in this article, questions in AI BRAT are grouped and ordered according to the mnemonics. Definitions for each type of bias appear below each sub-heading, and tooltips with examples can be viewed by hovering over or tapping on the question mark icon to the right.

Expanding a sub-heading reveals questions associated with the considerations made for detecting and/or mitigating each type of bias and the likelihood of it occurring in the dataset. AI BRAT tracks answered questions from each section to ensure each type of bias is being covered.

Risk Calculation

As questions are answered, values for the severity and likelihood of AI bias are assigned to responses. Responses contribute to severity (impact) and likelihood (probability) values as follows:

Severity/Impact
I do not know or have not considered this type of bias (3 Points)
I have discovered unwanted bias and am unable to mitigate it (3 Points)
I have discovered unwanted bias but implemented bias mitigation (2 Points)
I have not discovered unwanted bias or determined it is acceptable (1 Point)

Likelihood/Probability
Likely (3 Points), Somewhat Likely (2 Points), Not Likely (1 Point)

AI BRAT then calculates a risk score using a 3x3 risk matrix and classifies it into high, medium, or low based on the result of multiplying the likelihood and severity values.

Interpreting, Using and Saving the Results

AI BRAT is initialized with a risk score of 100%. In other words, there is a 0% chance that the system is fair and each section is highlighted red. As users respond to each question, the goal is to drive the risk of bias score down until the section turns green, or until the value is as low as possible (in this case 11% which is 1 out of 9).

Results can also be saved to a report using the Export to PDF button at the top-right. Check out AI BRAT today at https://bias.test.ai and let us know what you think.

Other Resources and Tools on Testing AI Bias

Technology giants Google, Microsoft ,and IBM have all developed tools and guides for testing AI for bias and/or fairness.

Google’s What-If Tool

In this Google AI Blog, James Wexler describes the What-If Tool, a feature of the open-source TensorBoard web application, that facilitates visually probing the behavior of trained machine learning models. Watch the video below for an introduction to the tool, including an overview of its major features.

Introducing the What-If Tool for Visually Probing Trained Machine Learning Models

Microsoft’s Fairlearn Toolkit

Microsoft is tackling bias in machine learning through its new open-source Fairlearn Toolkit. Fairlearn is a Python package that enables ML engineers to assess their system’s fairness and mitigate observed unfairness issues. It contains mitigation algorithms as well as a Jupyter widget for model assessment. Besides the source code, this repository also contains Jupyter notebooks with examples of Fairlearn usage. In the video below, Mehrnoosh Sameki, Senior Product Manager at Azure AI, takes a deep-dive into the latest developments in Fairlearn.

IBM’s AI Fairness 360 Toolkit

The staff members of the Trusted AI group of IBM Research have released AI Fairness 360 — an open-source toolkit that helps you examine, report, and mitigate discrimination and bias in machine learning models throughout the AI application lifecycle. The toolkit contains over 70 fairness metrics and 10 state-of-the-art bias mitigation algorithms that have been developed by the research community. Check out a quick demo of AI Fairness 360 below.

What’s Next?

Wondering how to get involved? Here are some ways we believe folks can have an impact on testing AI and bias:

Devising practical testing methods and processes for preventing and detecting unwanted bias in datasets.
Developing new coverage models and static/dynamic analysis tools for validating AI and ML systems.
Mastering and contributing to the existing open-source toolkits for measuring fairness and detecting/mitigating unwanted AI bias.

References

Bias on the Web. R. Baeza-Yates. Comm. of the ACM, Vol. 61, № 6.
A Survey on Bias and Fairness in Machine Learning. N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan.
Testing AI and Bias. J. Arbon.
Software Testing Heuristics: Mind the Gap!. R. Bradshaw, and S. Deery.
The Quest for Quality in AI. D. Ardalan, T. M. King, N. Chelvachandran, K. Obring, Y. Sulaiman, J. Farrier, L. Zubyte, J. Jerina, and R. Mugri.
7 Types of Data Bias in Machine Learning, H. Lim.

Ryan Chan