Algorithmic systems are increasingly pervasive in our everyday lives, from screening job applicants to recommending content tailored to our interests and figuring out the fastest route for our commutes to work. However, with their widespread adoption comes the issue of algorithmic bias. A longstanding issue in AI development, algorithmic bias, is the systematic distortion in the data that is the foundation of AI and used to build and train AI systems or systems’ development, which causes unjust outcomes for certain groups of people. Instances of algorithmic bias abound, such as in recruitment tools and ad-targeting systems to facial recognition technology and sentencing algorithms. This bias disproportionately impacts marginalized communities such as Black, Indigenous, and other people of color, LGBTQIA+ communities, and women. These biases not only lead to discriminatory outcomes for users of these systems but can also perpetuate existing structural inequities. Data justice - which asserts that we have the right to choose if, when, how, under what conditions, and to what ends we are represented in datasets - can provide a meaningful framework to inform how we should detect and address algorithmic discrimination.
The Current State of Algorithmic Fairness Assessment
Organizations often collect demographic information from users to address bias in AI systems and assess system performance across identity groups. Consider a scenario where a company implements an algorithmic system to screen job resumes, aiming to identify the strongest candidates. To detect any bias, the company collects demographic data from job applicants to assess the system’s performance across different groups of people. The company could then use these results to inform a fairness intervention if they discover algorithmic bias in the system. This might include adjusting components of the system, retraining the system, further investigating the sources of bias, or choosing to discard the system altogether.
However, collecting sensitive demographic data from users can cause additional harm, particularly to groups already impacted by algorithmic discrimination. The company’s demographic data collection process could exclude trans or gender non-conforming applicants by only providing binary gender options, hiding any biased outcomes these applicants might face. The company could also wrongfully repurpose the demographic dataset collected for the fairness assessment to inform job ad-targeting efforts, even though participants only consented to the use of their dataset for the defined assessment. Finally, the company could fail to securely store the dataset and expose sensitive information to data leaks (such as data related to sexuality), which puts applicants at risk of harm or violence. These harms highlight a fundamental issue in the development of these systems: the apparent need to collect demographic data to address algorithmic discrimination conflicting with the imperative to prevent the harms that can stem from this process.
Given the lack of formal regulation in this space, organizations are often left to create internal fairness testing practices and policies, leading to inconsistent and frequently inequitable bias testing practices. These internal practices and policies are where I’ve focused my work. I found that teams primarily turn to quantitative bias testing using demographic data (similar to that outlined in the previous example). But while this data is collected for a good reason (to prevent bias), it can end up creating additional harms – like the expansion of surveillance infrastructure and the misrepresentation of identity groups. These harms usually impact the same communities most likely to experience algorithmic discrimination.
Taking action to ensure we build fairer AI systems
Clearly, there is a need to establish a normative framework for collecting sensitive data that is building the ever-growing use of AI. To address this challenge, my team and I have developed the Participatory & Inclusive Demographic Data Guidelines. The Guidelines aim to provide AI developers and teams within technology companies, as well as other data practitioners, with guidance on collecting and using demographic data for fairness assessments to advance the needs of data subjects and communities, particularly those most at risk of harm from algorithmic systems.
Our Guidelines are organized around the demographic data lifecycle. For each stage, we identify the key risks faced by data subjects and communities (especially marginalized groups), baseline requirements and recommended practices that organizations should undertake to prevent these risks, and guiding questions that organizations can use to achieve the recommended practices.
Central to the realization of fair and accountable AI systems is the concept of data justice, which asserts that people have a right to choose if, when, how, under what circumstances, and to what ends they are represented in a dataset. To uphold data justice, we outlined four fundamental principles:
Prioritize the Right to Self-Identification: Organizations should empower data subjects and communities to choose how their identities are represented during data collection.
Co-Define Fairness: Organizations should work with data subjects and communities to understand their expectations of ‘fairness’ when collecting demographic data.
Implement Affirmative, Informed, Accessible, & Ongoing Consent: Organizations must design consent processes that are clear, approachable, and accessible to data subjects, particularly those most at risk of harm by the algorithmic system.
Promote Equity-Based Analysis: Organizations should focus on the needs and risks of groups most at risk of harm by the algorithmic system throughout the demographic data lifecycle.
We envision a world where everyone has agency over their representation in a dataset, including datasets that address algorithmic discrimination. Data can provide a richer and more inclusive picture of the world and help build relevant systems and services for everyone. However, we need the right guardrails in the development of these tools to ensure this is achieved. We are now conducting a public comment period to ensure that this resource reflects the needs of the diverse stakeholders impacted by demographic data collection practices. And we want to hear from you! Please submit a comment below to inform the Guidelines.