What Science Really Says About Facial Recognition Accuracy and Bias Concerns
Throughout the United States, a number of activists are demanding that lawmakers ban the use of facial recognition technology. Such an extreme position demands a well-grounded justification. In many cases, the sole reason given for banning the use of facial recognition is the claim that photo matching technology is inherently inaccurate with respect to photos of women and minorities.
Surely, a technology used in identification processes with significant outcomes, like in law enforcement investigations, should perform consistently across demographic groups. But does the science really support the claims of inherent “bias” by ban proponents? What is the evidence being cited for this claim, and does it add up?
First, we must understand that no biometric identification technology is accurate 100% of the time and there will always be an error rate, however small. There are different factors that can affect accuracy for each modality, like fingerprint, iris and facial recognition. None are perfect, but they are constantly being improved by technology developers.
While there is evidence that some, especially older versions of facial recognition technology have struggled to perform consistently across various demographic factors, the oft-repeated claim that it is inherently less accurate in matching photos of Black and female subjects simply does not reflect the current state of the science. In fact, the evidence most cited by proponents of banning facial recognition technology is either irrelevant, obsolete, nonscientific or misrepresented. Let’s take a look.
The Problem With MIT’s “Gender Shades” Study
By far, the source most cited by media and policymakers as evidence of bias in facial recognition is Gender Shades, a paper published by a grad student researcher at MIT Media Lab in 2018. Heralded in many media reports as “groundbreaking” research on the bias in facial recognition, the paper is frequently cited as showing that facial recognition software “misidentifies” dark-skinned women nearly 35% of the time. But there’s a problem: Gender Shades evaluated demographic-labeling algorithms not facial recognition. Specifically, the study evaluated technology that analyzes demographic characteristics (is this a man or a woman?), which is distinctly different from facial recognition algorithms that match photos for identification (is this the same person?).
A second problem is that the algorithms evaluated were those publicly available in 2017 (now quite old given the pace of innovation in computer vision). Basically, what we learn from the Gender Shades study is that IBM software and several others were not very good at face/gender classification when tested in 2017, and IBM immediately challenged even that limited result, saying that its replication of the study suggested an error rate of 3.5%, not 35%. The report says nothing about the accuracy of facial recognition technology, yet it is often the sole data point cited in claims that facial recognition is rampantly inaccurate when it comes to persons of color. Misconstruing demographic labeling technology as facial recognition has continued with false claims that the technology is “even less reliable identifying transgender individuals and entirely inaccurate when used on nonbinary people,” based on tests of classification software that again do not involve identification.
ACLU’s Intentionally Skewed 2018 Test
Also frequently cited by critics of facial recognition technology is a 2018 blog post by the American Civil Liberties Union (ACLU) regarding a test it claimed to have performed using Amazon Rekognition, a commercially available cloud-based tool that includes facial recognition. According to the ACLU, it created a database of 25,000 publicly available images and ran a search against official photos of the 535 members of Congress, returning “false matches” for 28 of them. The ACLU claimed that since 11 of these matches, or 40%, were people of color, and only 20% of Congress overall are people of color, this is evidence of racial bias in facial recognition systems; however, the search results were returned using a “confidence level” of only 80%, which returns more possible matches with lower similarity scores. Even if it replicated typical law enforcement investigative uses by having the software return to the operator a set number of top-scoring potential matches, ACLU does not report the ranking of the nonmatching photos, or whether the matching photos were also returned in the search.
This nonscientific test is misleading because it was clearly performed and interpreted to provide a desired result. Amazon later conducted a similar test of the software using a 99% confidence threshold (similar to actual law enforcement investigative usage) against a vastly larger and diverse data set of 850,000 images – reporting the software returned zero “false matches” for members of Congress. However, the ACLU has since published additional results using state legislator and celebrity photos using its same flawed test methods.
A 2012 FBI Study Analyzing Now-Obsolete Algorithms
A nearly decade-old study involving researchers from both the Federal Bureau of Investigation and the biometrics industry has also been frequently cited to support bias claims. In this evaluation, several algorithms available at the time were 5-10% less likely to retrieve a matching photo of a Black individual from a database compared to other demographic groups; however, the now-obsolete algorithms tested predate the deep learning technologies enabling the thousand-fold increase in facial recognition accuracy since that time. For example, 10 years ago, accuracy was measured in errors per thousand candidates versus per million today – Stone Age versus Space Age tools.
2019 NIST Demographic Effects Report
For the past 20 years, the National Institute of Standards and Technology (NIST) Facial Recognition Vendor Test (FRVT) program has been the world’s most respected evaluator of facial recognition algorithms– examining technologies voluntarily provided by developers for independent testing and publication of results. But even NIST’s most significant work has been continually misrepresented in policy debates.
In 2019, NIST published its first comprehensive report on the performance of facial recognition algorithms specifically across race, gender and other demographic groups. Importantly, the report found that the leading top-tier facial recognition technologies had “undetectable” differences in accuracy across racial groups, after rigorous tests against millions of images. Many of the same suppliers are also relied upon for the most well-known U.S. government applications, including the FBI Criminal Justice Information Services Division and U.S. Customs and Border Protection’s (CBP’s) Traveler Verification Service.
Lower-performing algorithms among the nearly 200 tested by NIST– many in various stages of R&D rather than operationally deployed commercial products – did show measurable differences of several percentage points in accuracy across demographics. Critics completely ignored results that showed demographic differences are a solved problem for the highest-performing algorithms, seizing upon the difference with the lowest performers and claiming the report showed “African American people were up to 100 times more likely to be misidentified than white men.”
It has further become clear that outliers from fraudulent Somalian data is responsible for much of the reported difference. This part of the evaluation relied on foreign visa application data, and it was later learned that data from Somalia included rampant visa fraud. Higher levels of fraud in the data set means images of the same individual are erroneously labeled as belonging to different people. As a result, the report found a much higher match error for Somalian persons, when in reality the algorithms were properly identifying identity fraud wherein the same person is listed under multiple names. This information helps explain why, outside data from Somalia, nearly all other country-to-country comparisons across algorithms yielded much higher accuracy rates in the report (see SIA’s analysis) and why data from Somalia is no longer included in NIST’s ongoing evaluation (see below).
Ongoing NIST Facial Recognition Vendor Test Program
The 2019 NIST demographic report provided a moment-in-time snapshot of facial recognition algorithm performance, now two years old. What does scientific research say about the performance of facial recognition technology today?
NIST’s FRVT Ongoing series releases up-to-date analysis on a monthly basis, which surprisingly contradicts the 2019 demographic report. In fact, accuracy among subdemographics is very closely balanced, and if anything, the white male subdemographic shows the lowest accuracy, not the highest.
According to data from the most recent evaluation from June 28, each of the top 150 algorithms are over 99% accurate across Black male, white male, Black female and white female demographics. For the top 20 algorithms, accuracy of the highest performing demographic versus the lowest varies only between 99.7% and 99.8%. Unexpectedly, white male is the lowest performing of the four demographic groups for the top 20 algorithms. For 17 of these algorithms, accuracy for white female, Black male and Black female are nearly identical at 99.8%, while they are least accurate for the white male demographic at 99.7%. (See data beginning with figure 105 of page 154. For simplicity, accuracy is stated here as the true accept rate (TAR) at a set 0.01% false accept rate (FAR), the scientific measurement of biometric performance on the ability of the software to successfully match photos. Note TAR/FAR is the inverse of false nonmatch rate and false match rate.)
Furthermore, FRVT Ongoing uses mugshot data from US law enforcement records, which has firmly established ground truth (accurately labeled data), in contrast to the 2019 demographic report’s reliance on foreign governments to supply visa application data, which as described in the case of Somalia can be unreliable. In other words, the NIST FRVT Ongoing’s finding an absence of demographic bias is both more up to date and based on more accurate data than the 2019 demographic report.
NIST research has documented massive improvements overall accuracy in recent years, noting even in 2018 the software tested was at least 20 times more accurate than it was in 2014, and in 2019 finding “close to perfect” performance by high-performing algorithms with “miss rates” against a database of 12 million images averaging 0.1%. On this measurement, the accuracy of facial recognition is reaching that of automated fingerprint comparison, which is generally viewed as the gold standard for identification.
Next Steps to Addressing Accuracy Concerns
While no method of scientifically testing the accuracy of facial recognition algorithms is without limitations, so far the science shows that to the extent accuracy might vary across demographic groups (i.e., “bias”), the highest-performing algorithms do not have such an issue. At the same time, it is also clear that much more thorough, frequent scientific research, testing and evaluation of facial recognition technologies is necessary to both validate accuracy gains and provide tools to developers to ensure performance is consistent.
The market for facial recognition technology is global and extremely competitive, with suppliers continually working to provide technology that is as effective and accurate as possible across all types of uses, deployment settings and demographic characteristics. A product cannot be competitive if developed using nondiverse data and accuracy performance is not consistent. Algorithms used in border applications are a good example. For an application to be effective, the technology must be accurate for travelers from anywhere in the world and any racial background or demographic. CBP currently uses the technology for identity verification at 172 airports around the world, including at exit in 32 U.S. airports. To date, more than 77 million travelers have participated in the biometric facial comparison process at air, land and sea ports of entry.
Policymaking should focus on ensuring that facial recognition technology continues its rapid improvement, that only the most accurate technology is used in key applications and that it is used in bounded, appropriate ways that benefit society. SIA and stakeholders across multiple industries have urged Congress and the Biden administration to provide additional resources to NIST testing and evaluation programs, allow expanded testing activities NIST has identified to more thoroughly and regularly evaluate performance of the technology across demographic variables, test the full range of available algorithms, coordinate with federal agencies that deploy facial recognition in the field and assess identification processes incorporating both facial recognition technology and trained human review.
Even though most states and Congress have so far rejected bans on facial recognition as extreme, such policies are beginning to have a real-world impact on public safety in several jurisdictions limiting use in law enforcement investigations. Steps can be taken to ensure the technology is used accurately, ethically and responsibly without limiting beneficial and widely supported applications. SIA has developed policy principles that guide the commercial sector, government agencies and law enforcement on how to use facial recognition in a responsible and ethical manner, released comprehensive public polling on facial recognition use across specific applications and published information about beneficial uses of the technology.