What The Kids’ Game “Telephone” Taught Microsoft About
Entertainment By Elena Boaghi | October 13, 2017
7 minute Read
Can artificial intelligence be racist? Let’s say you’re an African-American student at a school that uses facial recognition software. The school uses it to access the building and online homework assignments. But the software has a problem. Its makers used only light-skinned test subjects to train its algorithms. Your skin is darker, and the software has trouble recognizing you. Sometimes you’re late to class, or can’t get your assignments on time. Your grades suffer. The result is discrimination based solely on skin color.
This isn’t a real example, but similar AI missteps have already become infamous in the tech industry and in social media. The industry is excited about AI for good reason — big data and machine learning are lighting up powerful experiences that were unimaginable just a few years ago. But for AI to fulfill its promise, the systems must be trustworthy. The more trust people have, the more they interact with the systems, and systems use more data to give better results. But trust takes a long time to build, and bias can tear it down instantaneously, doing real harm to large communities.
Recognizing exclusion in AI
Bias in AI will happen unless it’s built from the start with inclusion in mind. The most critical step in creating inclusive AI is to recognize where and how bias infects the system.
Our first inclusive design principle is to recognize exclusion. The guide we’re unveiling here breaks down AI bias into distinct categories so product creators can identify issues early on, anticipate future problems, and make better decisions along the way. It allows teams to see clearly where their systems can go wrong, so they can identify bias and build experiences that deliver on the promise of AI for everyone.
Five ways to identify bias
We worked with academic and industry thought leaders to determine five ways to identify bias. Then, we used childhood situations–like playing “Telephone” or dress-up– as metaphors to illustrate the behavior in each category. Why? We can all relate to childhood episodes of bias, and it fits into a nice metaphor: AI is in its infancy, and, like children, how it grows reflects how we raise and nurture it.
Each bias category includes a childhood metaphor that illustrates it, its definition, a product example, and a stress test for your teams and AI work. Here’s how the biases break down:
A young child defines the world purely on the small amount she can see. Eventually, the child learns that most of the world lies beyond the small set of information that’s within her field of vision. This is the root of dataset bias: intelligence based on information that’s too small or homogenous.
Definition: When the data used to train machine learning models doesn’t represent the diversity of the customer base. Large-scale data sets are the foundation of AI. At the same time, data sets have often been reduced to generalizations that don’t consider a variety of users and therefore underrepresent them.
Product example: Machine vision technologies — such as web cameras to track user movements — that only work well for small subsets of users based on race (predominantly white), because the initial training data excluded other races and skin tones.
Stress test: If you’re using a training data set, does that sample include everyone in your customer base? And if not, have you tested your results with people who weren’t part of your sample? What about the people on your AI teams — are they inclusive, diverse, and sensitive to recognizing bias?
Imagine some kids who like to play “doctor.” The boys want the doctor roles and assume the girls will play the nurses. The girls have to make their case to overturn assumptions. “Hey, girls can be doctors too!”
Definition: When the data used to train a model reinforces and multiplies a cultural bias. When training AI algorithms, human biases can make their way to machine learning. Perpetuating those biases in future interactions may lead to unfair customer experiences.
Product example: Language translation tools that make gender assumptions (e.g., pilots are male and flight attendants are female).
Stress test: Are your results making associations that perpetuate stereotypes in gender or ethnicity? What can you do to break undesirable and unfair associations? Is your data set already classified and labeled?
Imagine a girl getting a makeover. The girl likes sports, loves a natural look, and hates anything artificial. The beautician has different ideas about beauty, applies tons of makeup and a fussy hairdo. The results make the beautician happy, but horrify the girl.
Definition: When automated decisions override social and cultural considerations. Predictive programs may automate goals that go against human diversity. The algorithms aren’t accountable to humans, but make decisions with human impact. AI designers and practitioners need to consider the goals of the people affected by the systems they build.
Product example: Beautification photo filters reinforce a European notion of beauty on facial images, like lightening skin tone.
Stress test: Would real, diverse customers agree with your algorithm’s conclusions? Is your AI system overruling human decisions and favoring automated decision making? How do you ensure there’s a human POV in the loop?
A popular kids’ game is “Telephone.” The first person in a group whispers a sentence to the next person, who then whispers it to the next person — and so on until the last person says what they heard. The point is to see how the information changes naturally through so many hand-offs. But say one kid changes it intentionally to create a more ridiculous result. It may be funnier, but the spirit of seeing what happens naturally is broken.
Definition: When humans tamper with AI and create biased results. Today’s chatbots can make jokes and fool people into thinking they’re human much of the time. But many attempts to humanize artificial intelligence have unintentionally tainted computer programs with toxic human bias. Interaction bias will appear when bots learn dynamically without safeguards against toxicity.
Product example: Humans deliberately input racist or sexist language into a chatbot to train it to say offensive things.
Stress test: Do you have checks in place to identify malicious intent toward your system? What does your AI system learn from people? Did you design for real-time interaction and learning? What does that mean for what it reflects back to customers?
Think of the kid who gets a toy dinosaur for a present one year. Other family members see the dinosaur and give him more dinosaurs. In several years, friends and family assume the kid is a dinosaur fanatic, and keep giving more dinosaurs until he has a huge collection.
Definition: When oversimplified personalization makes biased assumptions for a group or an individual. Confirmation bias interprets information in a way that confirms preconceptions. AI algorithms serve up content that matches what other people have already chosen. This excludes results from people who made less popular choices. A knowledge worker who is only getting information from the people who think like her will never see contrasting points of view and will be blocked from seeing alternatives and diverse ideas.
Product example: Shopping sites that show recommendations for things the customer has already bought.
Stress test: Does your algorithm build on and reinforce only popular preferences? Is your AI system able to evolve dynamically as your customers changes over time? Is your AI system helping your customers to have a more diverse and inclusive view of the world?
Using this primer
As designers and creators of artificial intelligence experiences, it’s on us to be thoughtful about how AI evolves and how it impacts real people. This primer is the start of a long road to create experiences that serve everyone equally.
If we apply these ideas to our initial example of the African-American girl misread by the facial recognition software, we can label that as dataset bias: The software was trained with data that was too narrow. By recognizing and understanding those biases from the start, we can test the system against other human considerations, and build more inclusive experiences. Could our facial recognition software be subject to deliberately erroneous data? What other biases could infect the experience?
Most people working in AI have anecdotal evidence of situations like these. Embarrassing, offensive outcomes from unintentional bias that we all want to identify and avoid. Our goal here is to help you recognize the underlying bias that leads to these situations. Start with these categories and test your experience with these types of bias in mind, so you can focus on delivering the potential of AI to all your customers.