11
AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 The Scales of (Algorithmic) Justice: Tradeoffs and Remedies Matthew Sun (Stanford University; [email protected]) Marissa Gerchick (Stanford University; [email protected]) DOI: 10.1145/3340470.3340478 Introduction Every day, governmental and federally funded agencies — including criminal courts, wel- fare agencies, and educational institutions — make decisions about resource allocation us- ing automated decision-making tools (Lecher, 2018; Fishel, Flack, & DeMatteo, 2018). Im- portant factors surrounding the use of these tools are embedded both in their design and in the policies and practices of the various agen- cies that implement them. As the use of such tools is becoming more common, a number of questions have arisen about whether using these tools is fair, or in some cases, even le- gal (K.W. v. Armstrong , 2015; ACLU, Outten & Golden LLP, and the Communications Work- ers of America, 2019). In this paper, we explore the viability of poten- tial legal challenges to the use of algorithmic decision-making tools by the government or federally funded agencies. First, we explore the use of risk assessments at the pre-trial stage in the American criminal justice system through the lens of equal protection law. Next, we explore the various requirements to mount a valid discrimination claim — and the ways in which the use of an algorithm might com- plicate those requirements — under Title VI of the Civil Rights Act of 1964. Finally, we sug- gest the adoption of policies and guidelines that may help these governmental and feder- ally funded agencies mitigate the legal (and related social) concerns associated with using algorithms to aid decision-making. These poli- cies draw on recent lawsuits relating to algo- rithms and policies enacted in the EU by the General Data Protection Regulation (GDPR) (2016). Algorithms and Equal Protection One case of algorithmic decision-making in the public domain that has been recently sub- jected to increased scrutiny in recent years Copyright c 2019 by the author(s). is the use of risk assessments in the crim- inal justice system. Here, we focus on the use of criminal risk assessment at the pre- trial stage. The goal of risk assessment tools (RATs) at the pre-trial stage is typically to es- timate a defendant’s likelihood of engaging in a particular future action (for example, com- mitting a new crime or failing to appear in court) based on their similarity to defendants who have committed those actions in the past (Summers & Willis, 2010). This similarity is typically determined using factors regarding a defendant’s criminal history but may also include information about a defendant’s per- sonal and social history such as their age, housing and employment status, and in some cases, their gender (Summers & Willis, 2010; State v. Loomis, 2016). Risk assessments are not themselves decision-makers regard- ing detention; rather, they are tools used by a human decision-maker - typically a judge or magistrate (Desmarais & Lowder, 2019). In this section, we explore legal challenges pertaining to risk assessments on the ba- sis that their use, under some circumstances, may violate constitutional protections. In par- ticular, the Fifth Amendment guarantees equal protection under due process of law and ap- plies to the federal government (U.S. Const., amend. V.), while the Fourteenth Amendment guarantees equal protection and due process of law and applies to the states (U.S. Const., amend. XIV.). Our analysis focuses on the application of equal protection law to the use of algorithmic risk assessments. Specifically, we discuss policies around the use of gender and proxies for race in risk assessments and how each might interact with equal protection of the law. When an individual or entity believes that their right to equal protection has been violated by a governmental policy - such as the use of a risk assessment algorithm at the pretrial stage - they may challenge such a policy by, first, proving that the policy does indeed discrimi- nate in a way that is or was harmful to the indi- 30

The Scales of (Algorithmic) Justice: Tradeoffs and Remedies

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

The Scales of (Algorithmic) Justice: Tradeoffs and RemediesMatthew Sun (Stanford University; [email protected])Marissa Gerchick (Stanford University; [email protected])DOI: 10.1145/3340470.3340478

Introduction

Every day, governmental and federally fundedagencies — including criminal courts, wel-fare agencies, and educational institutions —make decisions about resource allocation us-ing automated decision-making tools (Lecher,2018; Fishel, Flack, & DeMatteo, 2018). Im-portant factors surrounding the use of thesetools are embedded both in their design and inthe policies and practices of the various agen-cies that implement them. As the use of suchtools is becoming more common, a numberof questions have arisen about whether usingthese tools is fair, or in some cases, even le-gal (K.W. v. Armstrong, 2015; ACLU, Outten& Golden LLP, and the Communications Work-ers of America, 2019).

In this paper, we explore the viability of poten-tial legal challenges to the use of algorithmicdecision-making tools by the government orfederally funded agencies. First, we explorethe use of risk assessments at the pre-trialstage in the American criminal justice systemthrough the lens of equal protection law. Next,we explore the various requirements to mounta valid discrimination claim — and the waysin which the use of an algorithm might com-plicate those requirements — under Title VI ofthe Civil Rights Act of 1964. Finally, we sug-gest the adoption of policies and guidelinesthat may help these governmental and feder-ally funded agencies mitigate the legal (andrelated social) concerns associated with usingalgorithms to aid decision-making. These poli-cies draw on recent lawsuits relating to algo-rithms and policies enacted in the EU by theGeneral Data Protection Regulation (GDPR)(2016).

Algorithms and Equal Protection

One case of algorithmic decision-making inthe public domain that has been recently sub-jected to increased scrutiny in recent years

Copyright c© 2019 by the author(s).

is the use of risk assessments in the crim-inal justice system. Here, we focus on theuse of criminal risk assessment at the pre-trial stage. The goal of risk assessment tools(RATs) at the pre-trial stage is typically to es-timate a defendant’s likelihood of engaging ina particular future action (for example, com-mitting a new crime or failing to appear incourt) based on their similarity to defendantswho have committed those actions in the past(Summers & Willis, 2010). This similarity istypically determined using factors regardinga defendant’s criminal history but may alsoinclude information about a defendant’s per-sonal and social history such as their age,housing and employment status, and in somecases, their gender (Summers & Willis, 2010;State v. Loomis, 2016). Risk assessmentsare not themselves decision-makers regard-ing detention; rather, they are tools used bya human decision-maker - typically a judge ormagistrate (Desmarais & Lowder, 2019).

In this section, we explore legal challengespertaining to risk assessments on the ba-sis that their use, under some circumstances,may violate constitutional protections. In par-ticular, the Fifth Amendment guarantees equalprotection under due process of law and ap-plies to the federal government (U.S. Const.,amend. V.), while the Fourteenth Amendmentguarantees equal protection and due processof law and applies to the states (U.S. Const.,amend. XIV.). Our analysis focuses on theapplication of equal protection law to the useof algorithmic risk assessments. Specifically,we discuss policies around the use of genderand proxies for race in risk assessments andhow each might interact with equal protectionof the law.

When an individual or entity believes that theirright to equal protection has been violated bya governmental policy - such as the use of arisk assessment algorithm at the pretrial stage- they may challenge such a policy by, first,proving that the policy does indeed discrimi-nate in a way that is or was harmful to the indi-

30

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

vidual (Legal Information Institute, 2018a).Thecourt evaluating the matter would then ana-lyze the policy in question through one of fourpossible lenses - strict scrutiny, intermediatescrutiny, rational basis scrutiny, or a combi-nation of the prior three, depending on thecharacteristic (race, national origin, gender,etc.) in question (Legal Information Institute,2018a).

One such notable challenge, which we refer-ence in the subsequent discussion, was Stateof Wisconsin v. Loomis (2016), in which EricLoomis challenged the use of the CorrectionalOffender Management Profiling for AlternativeSanctions (COMPAS) risk assessment to in-form a judge’s decision about how long hisprison sentence would be. Loomis challengedthe use of COMPAS on the grounds that it vio-lated his constitutional right to due process be-cause the tool itself was proprietary (in partic-ular, Loomis knew the factors used on the as-sessment but did not know how each of thosefactors was weighted and translated into ascore, and thus could not challenge its scien-tific validity), and because the tool used gen-der as a factor in the assessment (State v.Loomis, 2016).

Factor 1: Use of gender

Though many risk assessments used at thepretrial stage in the United States do not in-clude gender as a factor in the calculation ofrisk scores (Latessa, Smith, Lemke, Makar-ios, & Lowenkamp, 2009; VanNostrand et al.,2009), some pretrial risk assessments do con-sider gender, like COMPAS did in the case ofEric Loomis (State v. Loomis, 2016). More-over, evidence indicates that risk assessmentsmay not be equally predictive across genders,and may overestimate the recidivism risk ofwomen compared to men (Skeem, Monahan,& Lowenkamp, 2016). Such evidence sug-gests the counterintuitive idea that includinggender in the calculation of risk scores may bemore equitable than excluding it. To illustratethe complexities of this point, we consider twohypothetical scenarios regarding risk assess-ments and gender.

Consider a hypothetical risk assessment Xthat includes gender in its calculation of riskscores; assume X has been challenged on thebasis that its use of gender violates equal pro-tection. Equal protection claims involving gen-

der classifications are subject to intermediatescrutiny, a test established by the SupremeCourt in Craig v. Boren (1976). To pass inter-mediate scrutiny, the policy in question must”advance an important government interest”by means that are ”substantially related to thatinterest” (Legal Information Institute, 2018b;Craig v. Boren, 1976). The defendant (the ju-risdiction that uses X to inform pretrial releasedecisions) might argue that, because judgesrely on the accuracy of risk scores when mak-ing decisions about who to release and be-cause these risk scores are meant to informtheir decision-making, the use of gender in Xadvances an important government interest -ensuring public safety through release deter-minations. The defendant might also arguethat, given the evidence on differential predic-tive power by gender, the use of gender is in-deed a means that is ”substantially related” topublic safety.

In the case of Loomis, the court determinedthe use of gender was permissible because itimproved accuracy, a non-discriminatory pur-pose (State v. Loomis, 2016). Yet some ar-gue that such evidence regarding the differen-tial predictive power by gender is too general.Legal scholar Sonja Starr has argued that be-cause the Supreme Court has rejected theuse of broad statistical generalizations aboutgroups to justify discriminatory classifications,the use of gender in risk assessment (specif-ically at sentencing) is unconstitutional (Starr,2014). In the case of X, the court would haveto consider, given the relevant evidence, if it isactually the case that using gender as a factoris substantially related to public safety, weigh-ing the tension between the group classifica-tions in X and the principle of individualizeddecision-making in the criminal justice system.

Now consider risk assessment Y, a risk as-sessment that doesn’t include gender in itscalculation of risk scores, and suppose that ajurisdiction that uses Y has analyzed its owndata and found that Y is better at predictingrecidivism for men than it is at predicting re-cidivism for women. In this case, the policyin question is facially neutral (the use of Ydoesn’t appear to be discriminatory towardswomen and doesn’t specifically include gen-der in its calculations), but nonetheless has adisparate impact because it rates women ashigher risk than they actually are. If the use

31

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

of Y were challenged under equal protection,the challenger would have to show intent - inparticular, that the governmental body usingY intended to discriminate against women byusing Y. In Personnel Administrator of Mas-sachusetts v. Feeney (1979), the SupremeCourt was faced with the question of whethera facially neutral policy that had a disparateimpact on women was a violation of equal pro-tection. A key question was whether the ”fore-seeability” of the policy’s disparate impact wassufficient proof of discriminatory intent; thecourt held that it was not (Weinzweig, 1983).Thus, if the ruling from Feeney were applied tothe hypothetical case regarding Y, awarenessof Y’s differential predictive power for men andwomen may not necessarily qualify as proof ofintent to discriminate, and the equal protectionclaim against Y may fall short.

Factor 2: Use of proxies for race

Now consider a hypothetical risk assessmentZ that uses factors such as the stability of adefendant’s housing or their employment sta-tus - in practice, many risk assessments doconsider these factors, as they are correlatedwith recidivism risk (Summers & Willis, 2010).However, these factors may serve as prox-ies for race (Barocas & Selbst, 2016; Corbett-Davies & Goel, 2018). Though classificationsinvolving race or national origin are typicallysubject to strict scrutiny, absent an explicit dis-criminatory classification, both disparate im-pact and discriminatory intent are required toeven trigger a scrutiny test (as they would bein the hypothetical case of Y, described above)(Arlington Heights v. Metropolitan HousingDev. Corp., 1977). Thus, for Z’s use to bechallenged because of its use of proxies forrace, one would need to show both that Z hasa disparate impact (for example, that thoughscores inform decision-making for all people,Z is less accurate for minorities than for whitepeople, which may or may not be true in thecase of this hypothetical) and that Z was de-signed or used to be discriminatory againstthe minority group(s) in question. Demonstrat-ing this intent may prove challenging becauseof the correlation between these socioeco-nomic factors and recidivism risk; nonethe-less, the tension between statistical general-izations about groups of people and the rightto an individualized decision for each defen-dant is ever present.

More broadly, legal challenges to the use ofRATs under constitutional law speak to anunderlying theme of the use of algorithmsmore generally: the use of these tools doesnot fit neatly into established legal standards(Barocas & Selbst, 2016), and tradeoffs will bepresent, whether mathematical, social, both,or otherwise (Corbett-Davies & Goel, 2018).Moreover, in the presence of facially neutralRATs, understanding intent is crucial to un-derstanding if the law has been violated. Inthe remedies section, we propose inquiresaround RAT implementation that may helpclarify the intent of policymakers and agencieswho adopt these tools and inform the publicabout the agencies’ decision-making rationalein the presence of tradeoffs.

Algorithms and Civil Rights Law

Beyond the constitutional arena, disparate im-pact theory has another, distinct form in civilrights law. Famously, Title VII of the CivilRights Act of 1964 explicitly bars employmentpractices that would generate a disparate im-pact, defined by the following conditions: 1)the policy creates an adverse effect that fallsdisproportionately upon a particular protectedclass, 2) the specific policy in place is not a”business necessity,” and 3) there exists an al-ternative policy that would not result in dispro-portionate harms (42 U.S.C. §2000e et seq.).In Griggs v. Duke Power Co. (1971), theSupreme Court found that Duke Power’s re-quirement of a high school diploma for itshigher paid jobs was illegal under Title VII ofthe Civil Rights Act of 1964 because it dispro-portionately barred minority groups from thosepositions and did not have any demonstrablerelation to performance on the job.

Beyond Title VII and employment practices,the Court has ruled in multiple cases involv-ing federal statutes with disparate impact pro-visions, such as Lau v. Nichols (1974) andAlexander v. Choate (1985), that policieswhich create adverse disparate impact are inviolation of the law, regardless of the intent ofthose policies or whether the policies are ap-plied equally to all groups. Such policies thatcreate a disparate impact constitute a viola-tion of Title VI of the Civil Rights Act of 1964,which was enacted at the same time as TitleVII (42 U. S. C. § 2000d). We choose to now

32

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

shift focus to Title VI because Title VI stipu-lates that all programs or activities that receivefederal funding may not perpetrate or perpet-uate discrimination on the grounds of race,color, or national origin, while Title VII onlyconcerns employment (42 U. S. C. § 2000d).However, we note that the U.S. Department ofJustice has recently stated that Title VI ”fol-lows...generally..the Title VII standard of prooffor disparate impact”; thus, cases that concernTitle VII ”may shed light on the Title VI analy-sis in a given situation” (U.S. Department ofJustice, 2019).

Twenty-six federal agencies have Title VI reg-ulations that address the disparate impactstandard, including USDA, the Departmentof Health and Human Services, and the De-partment of Education (U.S. Department ofJustice, 2019). These federal agencies pro-vide funding to a massive array of public pro-grams and the social safety net, including pub-lic schools, Medicaid, and Medicare. In Lau v.Nichols (1974), for example, the Court foundthat the San Francisco Unified School Districtwas in violation of Title VI because it receivedfederal funding yet imposed a disparate im-pact on non English-speaking students, manyof whom were not offered supplemental lan-guage instruction or placed into special edu-cation classes.

This regulatory and legal landscape sets thestage for the application of disparate impacttheory under civil rights law as an importantpossible remedy for discrimination in algo-rithmic decision-making. As state and localgovernments increasingly turn towards auto-mated tools to lower costs, ease administra-tive burdens, and deliver benefits, we are likelyto observe cases where algorithms, especiallywhen deployed without comprehensive over-sight and auditing processes in place, createunequal outcomes. In her book Automating In-equality: How High-Tech Tools Profile, Police,and Punish the Poor, Professor Virginia Eu-banks examines a statistical tool used by theAllegheny County Office of Youth, Children,and Families that processes data from pub-lic programs to predict the likelihood that childabuse is taking place in individual householdsacross the county (Eubanks, 2018). Becausethe frequency of calls previously made on afamily is an input to the algorithm, Eubanksargues that the tool may systematically dis-

criminate against Black families, since Blackfamilies are far more likely to be called onby mandatory reporters or anonymous callers(Misra, 2018). The Office of Youth, Chil-dren, and Families is overseen by the Al-legheny County Department of Human Ser-vices, which receives federal funding and asa result may be subject to regulation under Ti-tle VI (Allegheny County, 2019).

In these cases and many others, there is of-ten no obvious evidence of discriminatory in-tent; to the contrary, algorithms are commonlydeployed in the hopes of mitigating human bi-ases (Lewis, 2018). In Allegheny County, offi-cials stressed that the predictive risk-modelingtool would guide, not replace, human decision-making (Hurley, 2018; Giammarise, 2017).Yet, we often see that algorithms may stillproduce significant adverse impact on popu-lations when analyzed on the basis of race orgender. As a result, groups or individuals maynaturally seek to challenge the use of such al-gorithms in programs receiving federal fund-ing under Title VI. According to a Justice De-partment legal manual on Title VI, three con-ditions are required to constitute a violation ofTitle VI: 1) statistical evidence of disparate ad-verse impact on a race, color, or national origingroup, 2) the lack of a substantial legitimatejustification for the policy, and 3) the presenceof a less discriminatory alternative that wouldachieve the same objective but with less of adiscriminatory effect (42 U. S. C. § 2000d).

In the following sections, we explore how dis-parate impact claims against the usage of al-gorithms might fail to succeed in court forthree separate reasons. These challengescan be summarized as the lack of presenceof a less discriminatory alternative, the use ofpredictive accuracy as ”substantial legitimatejustification” for the policy, and the possibil-ity that the only way to ameliorate disparateimpact would be to treat different groups dif-ferently, thus triggering a disparate treatmentlegal challenge. We explore the current stan-dard for how a complainant (i.e., plaintiff) mustprove disparate impact under Title VI, and howa recipient (i.e., defendant) might ultimatelycircumvent their claims.

Challenge 1: Proving the presence of a lessdiscriminatory alternative

The phrase ”less discriminatory alternative”

33

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

implies that there exists a way to compare aset of policies and determine which is the leastdiscriminatory. However, when it comes toalgorithmic decision-making, the definition of”fairness” (in other words, the absence of dis-crimination) is hotly debated (Gajane & Pech-enizkiy, 2017). For example, the notion of”classification parity” is defined as the require-ment that certain measures of predictive per-formance, such as the false positive rate, pre-cision, and proportion of decisions that arepositive, be equal across protected groups(Corbett-Davies & Goel, 2018). For exam-ple, in order to satisfy false positive classifi-cation parity, the Allegheny County child ne-glect prediction algorithm must make an incor-rect positive prediction (i.e., predict the pres-ence of child abuse in a family where noneis occurring) at the same rate for both Whiteand Black families. Another commonly refer-enced notion of fairness is ”calibration,” whichrequires that outcomes be independent of pro-tected class status after controlling for esti-mated risk (Corbett-Davies & Goel, 2018). Ifthe aforementioned algorithm were to satisfycalibration, child abuse must be found to actu-ally occur at similar rates in White and Blackfamilies predicted to have a 10% risk of childneglect.

These definitions may sound like they mea-sure roughly similar phenomena, but re-cent research on algorithmic fairness showsthat they are often in competition, producingprovable mathematical tradeoffs among eachother (Corbett-Davies & Goel, 2018). Op-timizing calibration, for example, may resultin reductions in classification parity. ProPub-lica’s analyis of the use of COMPAS at thepretrial stage in Broward County, Florida re-vealed that the algorithm yielded much higherfalse positive rates for Black defendants thanit did for White ones (Angwin, Larson, Mattu,& Kirchner, 2016), but at the same time, in-dividuals given the same COMPAS risk scorerecidivated at the same rate (Corbett-Davies,Pierson, Feller, Goel, & Huq, 2017). In otherwords, the algorithm was calibrated, but wasmore likely to incorrectly classify Black defen-dants as ”high risk” for recidivism than Whitedefendants. To further complicate the notionof discrimination, the algorithm used in Al-legheny County to predict risk of child neglectwas miscalibrated in a way that disfavored

White children: White children who receivedthe same risk score for neglect as Black chil-dren were actually less likely to be experienc-ing maltreatment (Chouldechova, Benavides-Prado, Fialko, & Vaithianathan, 2018). In thiscase, Eubanks’ critiques of the algorithm’s in-puts and other researchers’ empirically mea-sured calibration result in directly opposingviews of which racial group is experiencingdiscrimination.

Without a single, legally-codified definition offairness, we see the first obstacle to a suc-cessful disparate impact claim: a recipientcan argue that no less discriminatory alterna-tive exists, since any alternative will likely in-volve tradeoffs across different measures offairness. Moreover, we suggest that it is in-sufficient to choose one measure of fairnessas the priority in all cases, since the soci-etal costs associated with different fairnessmeasures varies across specific applications(Corbett-Davies & Goel, 2018). For example,one might argue that the societal and/or moralcost of incorrectly detaining a Black individ-ual who will not recidivate is far greater thanthe cost of incorrectly flagging a Black house-hold for child abuse. Another person mighttake the opposite position, but in either case,blindly prioritizing false positive parity acrossboth tasks would fail to recognize the uniquecosts associated with each one.

There also exist practical legal challenges andambiguity regarding the existence of a lessdiscriminatory alternative. In the realm of Ti-tle VII, scholars disagree about whether ”re-fusal” to adopt a less discriminatory proceduremeans that the employer cannot be held liableuntil it has actively investigated such an alter-native and subsequently rejected it (Barocas &Selbst, 2016). This debate raises the questionof whether employers should be held respon-sible to perform a costly, exhaustive search ofall potential alternatives, or whether the cost ofdoing such a search would functionally meanthat less discriminatory alternatives do not ex-ist. According to the U.S. Department of Jus-tice’s guidance regarding Title VI, the burdenis on the complainant to identify less discrim-inatory alternatives (U.S. Department of Jus-tice, 2019). This may pose a significant chal-lenge to complainants, as they may not haveaccess to the documents and data needed toshow which alternatives would be equally ef-

34

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

fective in practice.

Challenge 2: Substantial legitimate justifica-tion

The second failure mode for a disparate im-pact claim is that the recipient has articulateda ”substantial legitimate justification” for thechallenged policy (42 U. S. C. § 2000d). As theJustice Department discloses in its Title VI le-gal manual, ”the precise nature of the justifica-tion inquiry in Title VI cases is somewhat lessclear in application” (U.S. Department of Jus-tice, 2019). For example, the EPA stated in its2000 Draft Guidance for Investigating Title VIAdministrative Complaints that the ”provisionof public health or environmental benefits...tothe affected population” was an ”acceptablejustification” (Draft Title VI Guidance, 2000).This document was compiled after a 60-dayperiod of 7 public listening sessions at the re-quest of state and local officials seeking clar-ification in an effort to avoid Title VI violations(Mank, 2000). In contrast, Title VII substitutesthe ”legitimate justification” requirement with a”business necessity” stipulation (42 U. S. C.§ 2000d). Because Title VI covers a broadscope of federally funded programs, ”legiti-mate justification” must be defined on a case-by-case basis, whereas ”business necessity”has a narrower meaning in case law due to Ti-tle VII’s specific focus on hiring practices (U.S.Department of Justice, 2019).

In the case of programmatic decision-making,discrimination may occur when practitionersdo not properly audit their algorithm beforeand while it is deployed. Such an audit couldtake many forms, such as running a random-ized control trial before permanently imple-menting an algorithm or releasing public re-ports every year regarding how well the al-gorithm is performing. (For the purposes ofthe following discussion, we assume that thetask at hand is one of binary/multiclass clas-sification, also known as a ”screening proce-dure”). In the field of machine learning, al-gorithms are commonly trained by iterativelyimproving performance on a given dataset,as measured by average classification accu-racy (Alpaydin, 2009). If average classifica-tion accuracy is not disaggregated across pro-tected groups present in the dataset, dispari-ties in the algorithm’s performance may onlybe discovered once the algorithm is already

deployed for real-world use (Buolamwini & Ge-bru, 2018), which could result in a subsequentdisparate impact claim. In this sequence ofevents, the potentially offending entity wasoptimizing for overall accuracy and failed totake the possibility of disparate impact into ac-count.

This scenario raises the question of whetherthe desire to optimize raw predictive accuracycounts as a ”substantial legitimate justifica-tion” for an algorithm whose outputs are bi-ased. It seems plausible that any recipientcould argue that predictive accuracy is a legit-imate justification: after all, optimizing accu-racy maximizes the total number of decisionsmade correctly, given that the demographicmakeup of the dataset resembles that of thereal-world population. Optimizing for any othermetric, such as an arbitrary fairness mea-sure, may lead to an algorithm with lower over-all predictive accuracy (Zliobaite, 2015; Klein-berg, Mullainathan, & Raghavan, 2016). A re-cipient of a disparate impact claim could arguethat maximizing accuracy leads to higher effi-ciency and lower costs for cash-strapped gov-ernment agencies. In the Allegheny Countyexample, having an algorithm accurately flagfamilies for risk of child neglect reduced thetime required to manually screen applications,saving time and labor. Because ”substantiallegitimate justification” is relatively ambiguousand case-specific, it may be difficult for a com-plainant to prove that maximizing classificationaccuracy is not a legitimate justification.

Challenge 3: A disparate impact and disparatetreatment Catch-22

It’s important to note that optimizing accuracyand fairness measures is not always a zero-sum game. In the aforementioned researchabout gender in criminal risk assessment, in-cluding gender as a variable in the dataset im-proved calibration and predictive accuracy be-cause women with similar criminal histories tomen recidivate at lower rates (Skeem et al.,2016) (Notably, gender is not a protected at-tribute under disparate impact clauses in civilrights law). Similarly, in other cases, we maybe able to improve predictive accuracy andproduce gains in fairness measure(s) if somepredictive latent variable is identified and in-cluded in the dataset (Jung, Corbett-Davies,Shroff, & Goel, 2018).

35

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Consider the case of a hypothetical algorithmthat estimates recidivism risk and takes raceas an input, but does not take criminal his-tory as an input. Assume in this scenario thatcriminal history is more predictive of recidi-vism than race. If Black people are dispropor-tionately likely to have prior convictions - per-haps due to disparate policing practices - thenthe algorithm will ”penalize” all Black peopleby giving them higher risk scores, even oneswithout prior convictions. If criminal history isadded to the dataset and the algorithm is re-trained, the algorithm’s accuracy will increasedue to the addition of a predictive variable. Inaddition, the algorithm’s performance on fair-ness measures may increase as well, sinceBlack people without criminal histories will nolonger receive a penalty for their racial status.

It may be the case, however, that the latentvariable whose inclusion would improve fair-ness and accuracy is the protected attributeitself (Jung et al., 2018). Including gender asan input to the algorithm would resolve the un-equal outcomes in which women are unfairlypenalized, but at the same time, explicitly al-tering decisions based off of an individual’sgender is a clear example of disparate treat-ment (42 U. S. C. § 2000d). The same wouldbe true with regard to protected attributes un-der Title VI such as race, national origin, andreligion. Disparate treatment, in which policiesexplicitly treat members of different protectedgroups differently, is prohibited by Title VI, aswell as many other civil rights laws (U.S. De-partment of Justice, 2019). Disparate treat-ment cases are arguably easier to prove, sincediscrimination is explicitly codified in a recip-ient’s policies, while disparate impact casesrely on measures of a policy’s outcomes defacto (Selmi, 2005). The fact that both dis-parate treatment and disparate impact violatecivil rights statutes may create a Catch-22 forentities seeking to resolve disparate impact inalgorithmic decision-making.

Indeed, Kroll et al. (2016) note this tension asmanifested in the Supreme Court’s decision ina 2009 case involving Title VII, Ricci v. DeSte-fano (2009). In the case, the New Haven CivilService Board (CSB) refused to certify the re-sults of a facially neutral test for firefighter pro-motions out of disparate impact concerns, not-ing that the pass rate for minorities was halfthat for whites. As Kroll et al. (2016) note,

the Court’s decision to rule against the CSB”demonstrates the tension between disparatetreatment and disparate impact,” since a neu-tral policy can create disparate outcomes, butmitigating the disparate impact would requirediscriminatory treatment of different groups.

Remedies

As we have seen from the above analysis,there is reason to believe that today’s con-cerns regarding algorithmic bias will not be re-solved in the courts alone, despite the highnumber of pending court cases regarding theuse of algorithms. In the Constitutional realm,absent a suspect classification, both disparateimpact and discriminatory intent are neededto prove a violation of the law. In addition,the current requirements to make a success-ful claim of disparate impact under civil rightslaw are vague with regards to defining whata discriminatory outcome is, which may allowrecipients of complaints to leverage whichevermathematical constructs of fairness best sup-port the use of their algorithm.

If we cannot expect to find remedies from thejudiciary, where should citizens turn for relief?To address the above concerns, we propose aremedy in the form of a unified, collaborativeeffort between the agencies and legislatures,both at the federal and state levels. We de-tail what such an effort would look like below,using an international regulation to inform ourproposals.

The European Union’s General Data Protec-tion Regulation (GDPR) offers a compellingcase for broad legal regulations coupled withsignificant enforcement power. The GDPRprovides strong protections for individual pri-vacy by allowing governmental agencies topursue fines and investigations into privatecompanies for data mismanagement and pri-vacy breaches (Steinhardt, 2018). Withregards to automated decision-making, theGDPR (2016) makes mention of a ”right toexplanation” for users who seek explanationfor decisions made about them (e.g., loan de-nials) (Goodman & Flaxman, 2017). Oneof the European Commission’s senior advi-sory bodies on data protection released a setof guidelines regarding automated decision-making, which included requirements for com-panies to provide explanations for how users’

36

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

personal data was used by the algorithm(Casey, Farhangi, & Vogl, 2018). The samebody even included a recommendation forcompanies to introduce ”procedures and mea-sures to prevent...discrimination” and to per-form ”frequent assessments...to check for anybias” (17/EN. WP 251, 2017).

The fact that the mandates behind the GDPRhave been enforced in practice leads us tosuggest an approach in the U.S. that similarlycombines comprehensive legislation with newenforcement powers for government agencies(Lawson, 2019). Of course, attitudes and poli-cies regarding the regulation of private com-panies differ in the U.S. and the EU (Hawkins,2019). Thus, our proposal would not seekto impose regulations on all private compa-nies across the US, but rather public enti-ties that are already subject to significant gov-ernment oversight, such as federal agenciesor federally-funded programs. Indirectly, thisimplicates private companies such agenciesmay contract with to provide tools or servicesin their use of algorithmic technology.

The remedies we suggest apply to both ofthe main use cases we previously described;for federally funded agencies, these remediesmay be enacted through legislation or exec-utive rule-making. Similarly, these remediescould also be applied at the state and locallevel. In both cases, we recommend the cre-ation or significant expansion of agencies fo-cused specifically on the technical oversightand evaluation of algorithmic tools. For exam-ple, such an existing agency that might takeup this burden could be the newly createdScience, Technology Assessment, and Ana-lytics team at the U.S. Government Account-ability Office (U.S. Government AccountabilityOffice, 2019). While courts have been reluc-tant to conduct a ”searching analysis of alter-natives,” federal agencies are ”subject matterexperts charged with Title VI enforcement du-ties” and ”are well-equipped to...evaluate care-fully potential less discriminatory alternatives”(U.S. Department of Justice, 2019).

Our remedy additionally attempts to recognizeand address the significant gap in current civilrights legislation with regards to definitions ofdiscriminatory intent and disparate impact —which can generate a Catch-22 of sorts, evenfor well-meaning actors. Existing civil rights

legislation largely focuses on barring discrim-inatory intent that results in differential treat-ment on the basis of protected attributes, suchas race. Today, however, we see that in orderto remedy unintended discrimination in algo-rithmic decision-making, we may have to takeinto account such protected attributes: essen-tially, using differential treatment to amelioratedisparate outcomes. Federal and state legis-lation must acknowledge this nuance, allowingpractitioners to use protected attributes datato promote the most fair outcomes, where therelevance of such data and a suitable notionof fairness are determined on a case by casebasis. For example, under a bail reform lawin New Jersey, agencies may collect informa-tion about a defendant’s race and gender forpotential use in a risk assessment calculation,subject to the condition that decisions are notdiscriminatory along race or gender lines (NJRev Stat § 2A:162-25, 2014).

Legislation (or other regulation) should stip-ulate that public agencies that are going toadopt algorithms to help make decisions mustsubmit the following information to a relevantoversight agency (at the federal level, the of-fice described earlier, and at the state level,some state or local agency with relevant ex-pertise) prior to the algorithm’s adoption:

• What decision will the algorithm be used tomake or help make? How was that decisionor type of decision made before the use ofthe algorithm?• What are the reasons to implement such an

algorithm? Is the algorithm less expensive,or will it increase efficiency? Is the intentto make the decision-making process moreobjective?• What are the particular use cases and use

context of the algorithm? How will the algo-rithm’s outputs be interpreted? Will a humandecision-maker be involved? Who is thepopulation that the algorithm may be usedon? Are there any exceptions to this policy?• How will the algorithm be evaluated and, if

necessary, revised? Has funding been al-located for regular oversight? Who will beperforming the evaluations and how? Is thetext (or source code) or training data of thealgorithm publicly available?• Were alternatives considered? What other

options were considered, and why was this

37

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

one chosen? What were tradeoffs betweenthe different choices?

The submission of this information to a gov-ernmental body and the public before an al-gorithm is employed in practice could providegreater clarity to both the public and regulatorsregarding discriminatory intent and the poten-tial for discriminatory outcomes. Furthermore,by actively requiring actors to come up witha plan to monitor the algorithm, consider alter-natives, and think critically about the algorithmin the context of human systems, this policymay decrease the likelihood of algorithms pro-ducing unintended negative consequences inpractice.

Acknowledgments

We would like to thank Keith Schwarz for help-ful feedback.

References

17/EN. WP 251. (2017). Guidelines on Au-tomated individual decision-making andProfiling for the purposes of Regulation2016/679

ACLU, Outten & Golden LLP, and theCommunications Workers of America.(2019). Facebook EEOC complaints.https://www.aclu.org/cases/facebook-eeoc-complaints. (On-line; accessed June 1, 2019)

Alexander v. Choate, 469 U.S. 287 (1985)Allegheny County. (2019). DHS funding.

https://www.county.allegheny.pa.us/Human-Services/About/Funding-Sources.aspx. (Online;accessed May 18, 2019)

Alpaydin, E. (2009). Introduction to machinelearning. MIT press.

Angwin, J., Larson, J., Mattu, S., & Kirchner, L.(2016). Machine bias. ProPublica, May ,23.

Arlington Heights v. Metropolitan HousingDev. Corp., 429 U.S. 252 (1977)

Barocas, S., & Selbst, A. D. (2016). Bigdata’s disparate impact. Calif. L. Rev.,104, 671.

Buolamwini, J., & Gebru, T. (2018). Gendershades: Intersectional accuracy dispari-ties in commercial gender classification.

In Conference on fairness, accountabilityand transparency (pp. 77–91).

Casey, B., Farhangi, A., & Vogl, R. (2018).Rethinking explainable machines: TheGDPR’s “right to explanation” debate andthe rise of algorithmic audits in enter-prise.

Chouldechova, A., Benavides-Prado, D., Fi-alko, O., & Vaithianathan, R. (2018).A case study of algorithm-assisted de-cision making in child maltreatment hot-line screening decisions. In Conferenceon Fairness, Accountability and Trans-parency (pp. 134–148).

Civil Rights Act of 1964 Title VI, 78 Stat. 252,42 U. S. C. § 2000d.

Civil Rights Act of 1964 Title VII, 42 U.S.C.§2000e et seq.

Corbett-Davies, S., & Goel, S. (2018). Themeasure and mismeasure of fairness: Acritical review of fair machine learning.arXiv preprint arXiv:1808.00023.

Corbett-Davies, S., Pierson, E., Feller, A.,Goel, S., & Huq, A. (2017). Algo-rithmic decision making and the cost offairness. In Proceedings of the 23rdacm sigkdd international conference onknowledge discovery and data mining(pp. 797–806).

Craig v. Boren, 429 U.S. 190 (1976))Desmarais, S. L., & Lowder, E. M. (2019). Pre-

trial risk assessment tools: A primer forjudges, prosecutors, and defense attor-neys. MacArthur Foundation Safety andJustice Challenge.

Draft Title VI Guidance for EPA AssistanceRecipients Administering Environmen-tal Permitting Programs (Draft RecipientGuidance) and Draft Revised Guidancefor Investigating Title VI AdministrativeComplaints Challenging Permits (DraftRevised Investigation Guidance); Notice,65 Fed. Reg. 124 (June 27, 2000). Fed-eral Register: The Daily Journal of theUnited States.

Eubanks, V. (2018). Automating inequality:How high-tech tools profile, police, andpunish the poor. St. Martin’s Press.

Fishel, S., Flack, D., & DeMatteo, D. (2018).Computer risk algorithms and judicialdecision-making. Monitor on Psychol-ogy .

Gajane, P., & Pechenizkiy, M. (2017).On formalizing fairness in prediction

38

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

with machine learning. arXiv preprintarXiv:1710.03184.

Giammarise, K. (2017). Allegheny CountyDHS using algorithm to assist in childwelfare screening. https://www.post-gazette.com/local/region/2017/04/09/Allegheny-County-using-algorithm-to-assist-in-child-welfare-screening/stories/201701290002. (Online;accessed May 18, 2019)

Goodman, B., & Flaxman, S. (2017). Eu-ropean Union regulations on algorithmicdecision-making and a “right to explana-tion”. AI Magazine, 38(3), 50–57.

Griggs v. Duke Power Co., 401 U.S. 424(1971)

Hawkins, D. (2019). The cybersecurity202: Why a privacy law like gdprwould be a tough sell in the U.S.https://www.washingtonpost.com/news/powerpost/paloma/the-cybersecurity-202/2018/05/25/the-cybersecurity-202-why-a-privacy-law-like-gdpr-would-be-a-tough-sell-in-the-u-s/5b07038b1b326b492dd07e83/?utm term=.1cc41e57f9cf. (Online;accessed May 18, 2019)

Hurley, D. (2018). Can an algorithm tell whenkids are in danger. New York Times, 2.

Jung, J., Corbett-Davies, S., Shroff, R., &Goel, S. (2018). Omitted and includedvariable bias in tests for disparate im-pact. arXiv preprint arXiv:1809.05651.

Kleinberg, J., Mullainathan, S., & Raghavan,M. (2016). Inherent trade-offs in thefair determination of risk scores. arXivpreprint arXiv:1609.05807 .

Kroll, J. A., Barocas, S., Felten, E. W., Rei-denberg, J. R., Robinson, D. G., & Yu, H.(2016). Accountable algorithms. U. Pa.L. Rev., 165, 633.

K.W. v. Armstrong, No. 14-35296 (9th Cir.2015)

Latessa, E., Smith, P., Lemke, R., Makarios,M., & Lowenkamp, C. (2009). Creationand validation of the Ohio risk assess-ment system: Final report. Cincinati, OH:University of Cincinnati .

Lau v. Nichols, 414 U.S. 563 (1974)Lawson, R. P. (2019). GDPR enforcement

actions, fines pile up. https://

www.manatt.com/Insights/Newsletters/Advertising-Law/GDPR-Enforcement-Actions-Fines-Pile-Up. (Online; accessedMay 18, 2019)

Lecher, C. (2018). What happens when analgorithm cuts your health care. TheVerge.

Legal Information Institute. (2018a).Equal protection. https://www.law.cornell.edu/wex/equal protection. (Online; ac-cessed May 18, 2019)

Legal Information Institute. (2018b). In-termediate scrutiny. https://www.law.cornell.edu/wex/intermediate scrutiny. (Online;accessed June 1, 2019)

Lewis, N. (2018). Will AI remove hiringbias? https://www.shrm.org/resourcesandtools/hr-topics/talent-acquisition/pages/will-ai-remove-hiring-bias-hr-technology.aspx. (Online;accessed May 18, 2019)

Mank, B. C. (2000). The draft recipient guid-ance and the draft revised investigationguidance: Too much discretion for epaand a more difficult standard for com-plainants? Environmental Law Reporter ,30.

Misra, T. (2018). When criminal-izing the poor goes high-tech.https://www.citylab.com/equity/2018/02/the-rise-of-digital-poorhouses/552161/?platform=hootsuite. (Online;accessed May 18, 2019)

NJ Rev Stat § 2A:162-25 (2014)O.J. (L 119) (2016). Reg (EU) 2016/679 of the

European Parliament and of the Coun-cil of 27 April 2016 on the protection ofnatural persons with regard to the pro-cessing of personal data and on the freemovement of such data, and repealingDir 95/46/EC (General Data ProtectionRegulation)

Personnel Adm’r of Massachusetts v. Feeney,442 U.S. 256 (1979)

Reg (EU) 2016/679 of the European Par-liament and of the Council of 27 April2016 on the protection of natural per-sons with regard to the processing ofpersonal data and on the free move-

39

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

ment of such data, and repealing Dir95/46/EC (General Data Protection Reg-ulation). (2016).

Ricci v. DeStefano, 557 U.S. 557 (2009)Selmi, M. (2005). Was the disparate impact

theory a mistake. Ucla L. Rev., 53, 701.Skeem, J., Monahan, J., & Lowenkamp, C.

(2016). Gender, risk assessment, andsanctioning: The cost of treating womenlike men. Law and human behavior ,40(5), 580.

Starr, S. B. (2014). Evidence-based sen-tencing and the scientific rationalizationof discrimination. Stan. L. Rev., 66, 803.

State v. Loomis, 881 N.W.2d 749 (2016)Steinhardt, E. (2018). Euro-

pean regulators are intensifyingGDPR enforcement. https://www.insideprivacy.com/eu-data-protection/european-regulators-are-intensifying-gdpr-enforcement/. (Online;accessed May 18, 2019)

Summers, C., & Willis, T. (2010). Pre-trial risk assessment research summary.Washington, DC: Bureau of Justice As-sistance.

United States. Department of Justice. (2019).Title VI Legal Manual (Updated)

U.S. Government Accountability Office.(2019). Our new science, technol-ogy assessment, and analytics team.https://blog.gao.gov/2019/01/29/our-new-science-technology-assessment-and-analytics-team/. (Online; accessed May 18,2019)

U.S. Const. amend. V.U.S. Const. amend. XIV.VanNostrand, M., et al. (2009). Pretrial risk

assessment in Virginia.Weinzweig, M. J. (1983). Discriminatory im-

pact and intent under the equal protec-tion clause: The Supreme Court and themind-body problem. Law & Ineq., 1, 277.

Zliobaite, I. (2015). On the relation betweenaccuracy and fairness in binary classifi-cation. arXiv preprint arXiv:1505.05723.

Marissa Gerchick is arising senior at StanfordUniversity studying Math-ematical and Computa-tional Science. She isinterested in using data-driven tools to improve theAmerican criminal justicesystem.

Matthew Sun is a risingsenior at Stanford dou-ble majoring in ComputerScience and Public Pol-icy. He leads a studentgroup called CS+SocialGood and is interestedin applied AI research forsocially relevant issues.

40