Algorithm Fairness
(sketch)
(Jan. 4, 2021; rev. March 3, 2023)
This page is a sketch for a page that will discuss various issues regarding perceptions regarding the way the use of algorithms to predict outcomes is unfair to certain demographic groups. Its creation at this time is prompted by December 8, 2020 Crime Report article titled “Pretrial Risk Assessment Tools More Accurate in Than ‘Human Judgments Alone’: Experts.”
That article discusses a November or December 2020 “Open Letter to the Pretrial Justice Institute” drafted or endorsed by a group of twenty scholars a number of whom have written about algorithmic fairness issues. The Open Letter discusses a November 2020 report of the Pretrial Justice Institute titled “The Case Against Pretrial Risk Instruments” that had argued against the use of recidivism algorithms for pretrial release because the algorithms were unfair to black defendants. The Open Letter sensibly argues that decisions informed by recidivism algorithms are sounder than those that ignore such information but reflects the usual failure to understand the ways policies affect measures of racial disparity. The Open Letter also discusses the May 23, 2016 ProPublica article titled “How We Analyzed the COMPASS Algorithm” (data from which are used on in Recidivism Illustration page to show how the more lenient a pretrial release policy, the greater will tend to be relative racial differences in rates of denial of release and the smaller will tend to be relative racial differences in rates of being granted release).
Perceptions about algorithm fairness or unfairness usually do not involve what would commonly be termed the overprediction or underprediction of a favorable or adverse outcome for one group compared with another. When one group (Group A) has a higher likelihood of experiencing outcome X than another group (Group B) an imperfect predictor of outcome X will tend to underpredict the outcome for Group A and overpredict the outcome for Group B. That is, for example, an employment test on which whites outperform blacks will commonly overpredict the performance of blacks and underpredict the performance of whites; a recidivism algorithm will commonly underpredict the recidivism of blacks and overpredict the recidivism of whites.
Perceptions about algorithm unfairness, however, generally lie in the fact that, notwithstanding the over- and underprediction patterns just mentioned, (a) among persons who experience outcome X after being evaluated by the predictor, a higher proportion of Group B than Group A will have been identified by the predictor as unlikely to experience outcome X (so-called false negatives[i]) and (b) among persons who do not experience outcome X after being evaluated by the predictor, a higher proportion of Group A than Group B will have been identified by the predictor as likely to experience the outcome (so-called false positives). The issue, which is currently much discussed with regard to the fairness of algorithms used to make decisions about arrested or convicted persons is the same as that much discussed with regard to employment tests in Fairness in Employment Testing (Committee on the General Aptitude Test Battery, Commission on Behavioral and Social Sciences and Education, National Research Council, John A, Hartigan & Alexander K. Wigdor (eds.)(1989) and various others places in the 1980s and early 1990s.[ii] While I have not read a great deal of literature on algorithmic fairness, that which I have read seems to reflect a universal unawareness of the extent to which the issues it addresses was covered decades ago in the employment testing context.[iii]
Whether outcome X is the favorable or corresponding adverse outcome in a particular context, as well as who are deemed false positives or false negatives, is entirely arbitrary. Thus, in the case of employment tests where whites have higher scores than blacks, successful performance of the job is commonly regarded as outcome X. Thus, perceived test unfairness will commonly be described in terms of higher false negative rates for blacks (i.e., higher rates of test failure among blacks than whites who performed well on the job) and higher false positive rates for whites (i.e., higher pass rates among whites than blacks who did not perform well on the job). Since the test is no more predicting successful performance of the job than it is predicting unsuccessful performance of the job, the situation would as well be evaluated as from the perspective of higher false negatives for whites and higher false positives for blacks (that it, with the test failure rates being regarded as predictors of unsuccessful job performance).
In the case of algorithms used to predict recidivism, recidivism is commonly treated as X. Thus, in situations where black defendants have higher recidivism risk scores than white defendants, perceived algorithm unfairness will commonly be cast in terms of higher false positives for blacks and higher false negatives for whites. Commonly the racial differences in false positives and false negatives will be cast in relative terms.
With regard to both employment tests and recidivism algorithms, the characterization of the issue if often misleading. In the case of employment tests, the matter has at times been characterized in terms of underprediction of successful performance for blacks and overprediction for whites, when in fact that opposite is the case.
In the case of recidivism algorithms, a May 23, 2016 ProPublica article titled “How We Analyzed the COMPASS Algorithm” would state that “black defendants were far more likely than white defendants to be incorrectly judged to be at a higher risk of recidivism, while white defendants were more likely than black defendants to be incorrectly flagged as low risk.” Such statements, even if semantically accurate, must be interpreted with an understanding that, among all defendants, black defendants were in fact less likely to be incorrectly identified as a likely to recidivate than a white defendant and more likely than whites to be incorrectly identified as unlikely to recidivate. It is only among defendants who did not recidivate that blacks were more likely than whites to have been incorrectly identified as highly likely to recidivate and only among defendants who did recidivate that whites were more likely than blacks to have been incorrectly identified as highly unlikely to recidivate.
This page will eventually address several issues regarding perceptions about algorithm fairness. One issue will involve the failure to understand the way cutoffs affect the size of relative differences in false positives and relative differences in false negative. Observers commonly speak about reducing disparities caused by predictor (sometimes cast in terms of reducing the unfairness of the predictor), as in the Open Letter. But those doing so universally fail to understand the way that altering a cutoff affects measures of disparity.
In particular, persons discussing the impact of policies on measures of racial disparity in criminal justice outcomes, including probably all signatories to the Open Letter, act according to the belief that reducing adverse outcomes will tend to increase relative differences in rates of experiencing the outcomes. In fact, exactly the opposite is the case. While reducing an adverse outcome – by lowering the cutoff for the favorable outcome or otherwise – will tend to reduce relative differences in rates of avoiding the outcome (i.e., experiencing the corresponding favorable outcome), it will tend to increase relative differences in the adverse outcomes itself, as discussed, for example, in “Usual, But Wholly Misunderstood, Effects of Policies on Measures of Racial Disparity Now Being Seen in Ferguson and the UK and Soon to Be Seen in Baltimore,” Federalist Society Blog (Dec. 4, 2019); “United States Exports Its Most Profound Ignorance About Racial Disparities to the United Kingdom,” Federalist Society Blog (Nov. 2, 2017), and other reference discussed on the Recidivism Illustration page, as well as illustrated on the page itself.
But just as observers universally fail to understand the way altering cutoffs will tend to affect relative differences in favorable and corresponding adverse outcomes pursuant to the predictor, they fail to understand the way cutoffs affect the size of relative difference in false positives or false negatives.
This page may also address the way the validity of the predictor affects relative difference in false positives and relative difference in false negatives. That is, the more valid the predictor, the fewer will be the number of false positives and false negatives. But the effect on relative differences in false positives and false negatives is another matter.
With regard to the first issue, I may have understood these patterns better at a time when I examined the subject with regard to the perceived unfairness of employment tests 30 plus years ago. But, presently I am inclined to think that the lowering a cutoff for the prediction of outcome X, by increasing the number of positives and thus increasing the number false positives, and reducing the number of negatives and thus reducing the number of false negatives, will tend to reduce the relative difference in false positives but increase the relative difference in false negative. In the recidivism context, this means that the more relaxed the standards for release (i.e., the higher is the standard for incarceration), the higher will tend to be the relative difference in false positives and the smaller will tend to be the relative difference in false negatives.
The data underlying the ProPublica article, which would allow the generation of its various two-by-two tables according to different cut points, ought to allow one to determine whether the above surmise is correct or not.
With regard to the second issue, I am also inclined to think that the more accurate predictor, and hence at any cutoff the smaller will be the number of false positives and false negatives, the greater will tend to be both relative differences in false positives and relative differences in false negatives. Many statisticians should be capable of illustrating the accuracy of the surmises even if they do not yet understand the ways that altering the prevalence of an outcome tends to affect measures of racial disparity.
[i] I use the terms false positives and false negatives because they are employed in the literature. The instruments at issue actually predict varying likelihoods of experiencing an outcome and its opposite and such predictions may be entirely correct. A chosen cut point determines what is deemed a positive or negative and hence what is deemed a false positive and false negative.
[iii] The above description of perceptions of algorithm fairness should be qualified by somewhat in that recently it appears that discussion of algorithm fairness focuses solely on the disparate impact of a predictor. A income and credit score requirement for a loan might be regarded as unfair simply because it disqualifies a larger proportion of black loan applicants than white loan applicants while correspondingly qualifying a larger proportion of white loan applicants than black loan applicant. It is true, however, that such requirements will tend to exhibit the same patterns described above, i.e. – (a) overprediction of black and underprediction of white for successful loan performance (with corresponding underprediction of black and overprediction of whites for unsuccessful loan performance), (b) higher false negative (unsuccessful performance) rates for blacks than whites, and (c) higher false positive (successful loan performance) rates for blacks than whites – with (b) and (c)’s being regarded as the indicators of algorithm unfairness.