This is an except from my upcoming paper "From Open Data to Data Justice," which I'll present at the Midwest Political Science Association Conference in Chicago on April 13. Please join me if you're there.
Update: The full paper is available at both the MPSA conference program link above (which will only be available to members after the conference) and my SSRN archive, which is publicly available and will have the most up-to-date version of the paper.Data is not reality. It is rather a construct, an operationalization of an actor’s concept and reality, interpreting between the physical world and the intellectual structures by which actors understand that world. GIS shapefies, for instance, operationalize a relationship between physically existing land and legally existing property; interaction with a census taker operationalizes a relationship between bodies and citizens. A key implication of the constructivist understanding of data is that, for all of the celebration of (and weeping and gnashing of teeth over) the purported ubiquity of data collection and data as the “detritus” of human life in contemporary affluent societies, data does not, in fact, simply happen. The constructed nature of data makes it quite possible for injustices to be embedded in the data itself. Whether by design or as unintended consequences, the process of constructing data builds social values and patterns of privilege into the data. Where those values and privileges are unjust, the injustice is then a characteristic of the data itself; no amount of openness can remedy such injustices, just as no amount of statistical processing can undo inaccuracies in the original data. “Garbage in, garbage out” is a central concept in data ethics.
Datized moments occur most often in the interaction of an individual with a bureaucratic organization such as the state or a business. But people and groups differ in their propensity to interact with such organizations. This difference provides an important point by which privilege can enter into data. Data over-represents some, and where those over-representations parallel existing structures of social privilege, it over-represents those already privileged and under-represents those less likely to be part of data producing interactions.
Interactions with the state are rife with disparities that reflect social privilege. One well-studied example is the undercount of the decennial United States Census. Since the problem of undercounting was first quantified in the mid-Twentieth Century, black and Hispanic households have been undercounted at higher rates than non-black households. The causes of this undercount are myriad:
Households are not missed in the census because they are black or Hispanic. They are missed where the Census Bureau’s address file has errors; where the household is made up of unrelated persons; where household members are seldom at home; where there is a low sense of civic responsibility and perhaps an active distrust of the government; where occupants have lived but a short time and will move again; where English is not spoken; where community ties are not strong. (Prewitt 2010, 245)Two commonalities in these explanations are striking: the extent to which these causes are barriers to interaction with census takers, and the extent to which they are correlated with racial and class privilege. The latter causes the undercount to disproportionately affect disadvantaged groups (hence, Prewitt argues, the focus on race in debates over census methodology between 1980 and 2000), while the former prevents those groups from being represented accurately in Census data. Similar problems exist in collecting any data on groups such as the homeless. Groups might also be disproportionately willing to participate in some interactions over others, such as differences in thresholds for reporting building code violations between the affluent and poor.
Such privileges are not confined to interactions with the state. Residential segregation especially is often tied to forms of institutional discrimination that would influence how often individuals interact with bureaucracies. Zenk et al. (2005) found that low-income, predominantly African American neighborhoods in Detroit were, on average, 1.1 miles further from a supermarket than predominantly white neighborhoods with similar incomes, with consequently increased dependence on smaller food stores such as convenience stores or groceries. Similarly, Cohen-Cole (2011) argues that consumer credit discrimination based on the racial composition of applicants’ neighborhoods is linked to increased use of payday loans. In both cases, the use of less bureaucratized businesses by groups already suffering from discrimination in the form of de facto residential segregation (either as the legacy of formal segregation or because of ongoing discrimination) results in data that is statistically biased against such populations and reinforces whites’ privileged position. Businesses can analyze the needs of the (disproportionately white) customers with whom they interact and adapt accordingly; benefits thus accrue to the beneficiaries of social privilege.
Transforming information about a datized moment into data is equally problematic. Only some of the information about that moment will be datized. What information that will be is not a natural consequence of the interaction but a design choice on the part of the data architects that reflects their purposes, resources, and values. An institutional survey director noted to me that survey data at the institution is subject to state open records laws and sometimes requested by the public and state legislators. As a result, the survey director encouraged the practice of not collecting data that the institution would not be comfortable making public.# In this case the concern was privacy, but this reasoning is at least as likely when more self-interested motives are present. Regardless of the motivation, though, such decisions are value-laden; thus the data built on such decisions will embody those values and transmit them in the process of using the resulting data.
Less conscious assumptions such as those part of worldviews shaped by social privilege will also shape such decisions, and likely be less amenable to challenge to the extent of their invisibility to lack of diversity among the data collectors. Higher education “net price calculators” are a case in point. Such tools, which the federal government requires all institutions receiving Title IV aid to produce, are designed to help students and their families estimate the likely cost of attending an institution given the prevalence of “high-tuition, high-aid” business models. This assumes that the net price is what is important to students. But Sara Goldrick-Rab argues that the gap in applications to elite colleges between high-achieving, high-income and high-achieving, low-income students reported by Hoxby and Avery is rooted in “sticker shock” at the high gross price of such institutions among low-income families in spite of the institutions’ often much lower net prices. Their disregard of net price is in part a lack of information, but more significantly a consequence of such families’ lack of trust in institutions generally and substantially higher risk to such families if educational institutions fail to maintain the initial promises of aid, conditions that make the net price of the institutions less credible: “Being told that a college is likely to give you aid is not the same thing as getting the aid, [emphasis in original]” she writes. Such students choose to apply at less expensive (and consequently less selective) institutions as they present less risk to themselves and their families.
If Goldrick-Rab is correct, the credibility that the middle class finds in state and social institutions that have generally protected their interests should be seen as underlying the decision to collect and report average aid amounts that do not vary my income: middle class families can credibly take average aid as typical of people like them; low-income families cannot. One might expect the same to be true of first-generation students. With family members unfamiliar with the operations of universities, they will often be unaware of issues such as net price or even understand the financial aid process at all. Yet this background knowledge, like the credibility of a measure, is assumed in the selection of data to be collected. Those privileged with such knowledge find their privileges reinforced by this data; those who are not so privileged are further disadvantaged when they cannot see the data as meaningful.
Adding to this the question of how that information is stored increases the complexity of the issue. Key features in the problematic Bhoomi experience with open data were not only the selection of only certain types of documentation for inclusion in the land title data but also the decision to store the resulting data in a relational database system. These aspects of the system design effectively precluded informal knowledge from being part of the open data system; such knowledge, which was the basis of the existing land claims of Dalits, cannot be queried by the systems used to . The two features both inform and reinforce each other: excluding narratives and other unstructured data obviates the need for systems that can handle unstructured data such as those using text-analytics or Unstructured Information Management Architecture (UIMA), while the choice of a relational database precludes the use of narrative information. Donovan (2012) cites this as an instance of James Scott’s (1998) “seeing like a state” in which the local government sought to simplify society by making it legible. The open data system incorporated this value in its choice of what to datize about the moment in which land was transferred. This incorporated a value structure into the data, one that is clearly not neutral in the competition for power.
Because of the myriad ways that social privilege can become embedded in data sets, open data cannot be expected to universally promote justice. It can just as easily marginalize groups that are not part of the data: people whose lack of privilege excludes them from the kinds of interactions that produce data and makes their viewpoints invisible to those who collect data. Opening datasets composed of such data simply propagates the injustices that came into the data as it was collected. Whatever steps are taken to promote fairness in using data that is at its root unjust, the result will almost inevitably be unjust as well. Data is very much a case of “Injustice in, injustice out.”