25 February 2014

Data as Reality Made Legible

A fragment written for an upcoming paper on the construction of student data in higher education.

The ubiquity of data in contemporary society hides its peculiarity. Data is a very specific form of information, one in which the subject is broken down atomistically, measured precisely (in the sense of being measured to quite specific standards that may or may not involve a high level of quantitative precision), and represented consistently so that it can be compared to and aggregated with other cases. That this form of knowledge is more common in highly structured institutions and rose to ubiquity with the modern, bureaucratic state and the capitalist enterprise should surprise no one. Creating data should be regarded as a social process in which reality is made legible to the authorities of an institutional structure.


10 January 2014

On (Failed) Satire and Shitty Frat Boy Humor

I have a satire problem. Not a problem with satire, mind you. My Twitter avatar has a poster from Dr. Strangelove in the background, in German, no less. Clearly I appreciate good satire. It’s actually the opposite problem: I see satire whether it is there or not.

Part of that is simply due to the fact that I want to see satire. I really do believe that it is one of the most effective tools against the powerful. So much of power is the belief among the powerless in power’s legitimacy. Satire undermines that legitimacy, and lays bare the exercise of authority and force.

Contrast, for example, Dr. Strangelove with its contemporary, Fail-Safe. As frightening as the scenario is, as brilliant as Henry Fonda is as the President, as compelling a case for nuclear disarmament as it made, Fail-Safe still legitimates the President, the military leadership, and intellectuals as the ones to decide on the policy issues. That legitimates the decision itself even when those leaders reach the opposite conclusion. “Gentlemen, Gentlemen! You can’t fight in here! This is the War Room!” undermined any claim the Washington establishment had that its decision was the right one.

What passes for satire gets dicey, though, because the subject of the satire and its object are potentially different. When the subject of a joke is the oppressed, the challenge is to identify whether the object is the oppressor or the oppressed. In the former case, satire retains its power for the weak, as in Mel Brooks’ treatment of racism in Blazing Saddles. In the latter case, the joke is not a satirical tool against the powerful but rather a way of perpetuating domination. Identifying whether a joke is satirical or oppressive is a true challenge, important as it is difficult.

Last night, for example, Rhonda Ragsdale retweeted:
First, a bit of advice: if @profragsdale is calling something out, she’s probably right; save yourself the trouble, take her word for it, and learn something.

16 September 2013

Rendering Students Legible: Translation Processes in Private, Institutional, and Government Higher Educational Data Systems

This is an extended version of the proposal I submitted to the 2014 WPSA conference today. It builds on, and contributes to, the general ideas of my information justice paper from the 2013 MPSA conference.
To say that data is constructed seems trivially true; data architecture is a basic competency of data scientists and database design a core concern of information technology staffs. But embedded in this conventional view is an understanding of data consistent with scientific realism: the data that is available to us may be limited, but it nonetheless objectively represents the reality of a datized moment, interaction, or condition. In this view, the choices made in data architecture have at most minor substantive influence on the data itself.

This paper subjects educational data to a constructivist analysis that goes much further than the mechanics of database design. Students’ interactions with the institution, the state, and private actors present a problem of legibility for those actors, which is solved by datizing those interactions through a series of translations in the data process. The information that student databases contain is structured primarily by transformations in the data process and not the datized moment itself. I show how data is substantively constructed through its collection and management by conducting structural analyses of the data systems commonly used at Utah Valley University: Canvas, Banner, the Utah System of Higher Education reporting process, and IPEDS. I describe a series of characteristic transformations that take place in the collection, storage, and retrieval of data in these systems:
  1. from relevance to existence,
  2. from contingent to essential,
  3. from narrative to nominal,
  4. from complex to categorical (or plural to monistic), and
from diversity to normalcy. These transformations reduce the many ways of understanding reality to a single interpretation embodied in the data. Though reality does both provide inputs into the transformational process and constrains the ways that the process can transform those inputs into outputs, the choices made in developing data processes make impossible a database in which there is a one-to-one correspondence between reality and data. The information output is underdetermined by the datized moment itself, resulting in a Rashomon-like one-to-many correspondence that is reduced to a single output only with the conclusion of the data process.

This is more than merely suggesting that there are errors or biases in the data. In a realist view of data, errors and biases can be corrected by validating the data against itself or the reality it purports to represent. But the self-correcting process of scientific realism cannot do so within the kinds of transformations described in this paper. The transformations are consistent with reality because they follow the rules chosen for a specific data process. Actions are consistent with the conclusions drawn from that data because the data process has legitimized that data as the only acceptable representation of reality; all else can be dismissed as anecdote. Constructive data can only be challenged in confrontation with other constructions that may have been possible given alternative data processes.

Policymakers can gain much from understanding the transformations that govern the data that they use. Understanding them helps, first and foremost, determine the extent to which a data point--or combination of points--is an appropriate operationalization of a concept of concern, for example, when analyzing along a combined race-gender classification is more useful than analyzing by race and the gender sequentially. Such understanding also makes us aware of blind spots in our data where deviation or perceived irrelevance has been excluded. Finally, understanding the constructive nature of data is an antidote to a number of scientistic fallacies that can undermine data analysis and action.

But the deeper concern is for the injustices that can enter higher education through these translations. These translations exercise and distribute power between students and institutions, between institutions and the state, and among institutions by creating privileged representations and embedding both new and existing such representations in the data that is used in decisionmaking.

07 June 2013

Constraining the NSA with Deweyan "Expansive Privacy"

I can't say that I'm surprised about the NSA revelations. Pissed, yes, but when they have to build a new power plant in your area to power the NSA data center that I go past on the way to work every day, you know there's a lot of data being collected.

The constitutional implications of this are undoubtedly alarming. Some are arguing that this is a massive constitutional overreach, a total disregard of the Fourth Amendment. I'm worried that their wrong, though. I'm worried that NSA has simply embraced the logical implications of relying on pre-Internet privacy concepts to govern data, with the consequence that they've broken the Fourth Amendment's protections for the political process altogether. But I think John Dewey can fix it. In part because Dewey can fix anything, but mostly because his concept of the public-private distinction adapts well to the kinds of communication at the heart of the NSA mess.

Note: What follows is really speculative and not really researched. I'm relying on memory of past work, especially on the Dewey stuff. If you see anything really inane in this please bring it to my attention now before I shoot myself in the ass in a venue that is less speculative.

The Fourth Amendment establishes the fundamental constitutional constraints on the collection of data by the government, establishing that "The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated." In practice this has been held primarily to apply to criminal justice matters; it was not explicitly an issue, for instance, in the controversy over the data collected by the US Census Bureau even when that collection was challenged as intrusive. With regard to searches, the operational standard established in Katz v. United States (1967) is that a search has occurred when there is a reasonable expectation of privacy with regard to the object of the search.

Therein lies the problem with regard to NSA collect of both telephone metadata and PRISM data. There is no reasonable expectation of privacy with regard to telephone metadata because that metadata must be shared with the service provider to make the telephone call. The same is true of packet metadata in Internet communications. In the case of PRISM data, the expectation of privacy is even less reasonable. This is information that we are deliberately sharing with others, at the least intentionally sharing with the service provider (as opposed to incidentally, as in the case of metadata) and quite often sharing it with the service provider for the purpose of sharing with other users of the service. And in both cases, a reasonably well informed person should know that the service providers are using the data for their own purposes as well as ours. An expectation of privacy is (at least in the empirical sense, i.e., that the data will be treated as private regardless of one's expectation of whether it ought to be) entirely unreasonable. Far from an abrogation of constitutional principles, the NSA revelations should be seen as the logical conclusion of our existing understanding of what constitutes an unreasonable search.

Yet something doesn't sit right about this. Senate Majority Leader Harry Reid's advice that "Everyone should just calm down and understand this isn't anything that is brand new" was treated not as the stately voice of reason but as the imperious voice of deception, and rightly so. The constant criticism of Facebook for violating its users privacy is but one of the more prominent examples of a broad concern with abuse of the data users give service providers. While it is analytically unreasonable to see a Facebook status update that is restricted to one's friends as private, I think it unrealistic to say that society is unprepared to see such an expectation as such, and it is society's willingness to accept the reasonableness of the expectation rather than a lawyer's or academic's analysis that is the standard established in Katz.

Even if society were unwilling to accept this expectation of privacy as reasonable (and the criticism of millenials as prone to oversharing via social media suggests that at least some members of society are), taking the Katz standard to its apparent conclusion makes the Fourth Amendment essentially useless. The overarching purpose of our protections against unreasonable searches is to protect political participation from abuse of the criminal process by the state. If a police officer must show probable cause before searching the home of a government critic, then the government can't use the police to silence criticism by fishing for trivial crimes. But if the major means of communication and document storage are no longer private, the Fourth Amendment can't do that.

So what is needed is some way of rethinking when we have been searched that is meaningful for electtronic communication and data storage. One consideration that I've seen mentioned (but not substantiated) in this debate is the idea that a search doesn't take place until the data is queried rather than when it is collected, implied by the procedures outlined in Director of National Intelligence James Clapper's statement. I don't think this is adequate. A constitutional provision prohibiting unreasonable searches works to protect participation not only by keeping the police from acting beyond their authority but also by making sure the people know this. When potential participants know that the data is there waiting to be queried, that's likely to have a chilling effect on participation (especially when they are also aware that the court protecting their data from being queried didn't reject a single request last year). The data collection establishes a disciplinary process that keeps citizens within the bounds of criticism that the state is willing to permit.

My thought is instead to look to John Dewey's understanding of the public-private separation. Traditionally that's been seen as a categorical distinction: either something is entirely public or entirely private, with the categorization based on the subject matter. Dewey, though, rethought that in The Public and Its Problems. For him, the private was the realm of action in which only the participants experienced the consequences of the action; once the consequences went beyond the participants the matter was public: the distinction was not substantive but pragmatic (in the technical sense of connected to actions and their consequences).

That breaks down the categorical aspects of the distinction as well. Actors can bring the affected third-parties into the action, for example by allowing them to participate in its development or by compensating them for the effects of their action. This in some sense privatizes the action by eliminating third-parties; everyone affected is a participant, and no one outside the action has any standing to seek the intervention of the state (assuming, of course, that the incorporated parties have no reason to challenge the terms of their inclusion). This essentially creates a kind of "limited publicity": matters in which some but not all members of society have a right to participate based on the consequences of the action of them. And the counterpart of limited publicity is a notion of "expanded privacy." Once all of those members are participating fairly, the matter is private with regard to other members of society. Those other members would have no right to participate in the action. Action is thus public with regard to those who suffer its consequences and private beyond that point.

I think this gives us leverage on why the data collected by NSA appears simultaneously public and private. Ultimately my status update is a communication between me and the friends I include in its privacy settings. (No, boss, you don't see all of my posts even though you friended me. Live with it.) Facebook delivers that communication and as such is part of the process; my update is public with regard to Facebook. But beyond Facebook and my friends list, that communication is private. I have a reasonable expectation that it will not, as Helen Nissenbaum argues, be used out of the context in which it was communicated, for example as an endorsement of some advertisers's product. Such a realm of privacy would almost always exclude the state, protecting citizens from the mess that the return of the imperial presidency has created.

05 June 2013

"Charge of the AP readers"

Next week I'll be scoring the free-response questions AP US Government exam. A friend who shall remain nameless penned this tribute to the roughly 600 people who will spend eight hours a day for six days reading roughly one million essays.

Half a league, half a league,
Half a league onward,
All in the valley of Death
Rode the six hundred.
"Forward, Readers!
"Charge for the exams!" he said:
Into the valley of Death
Rode the six hundred.

"Forward, Readers!"
Was there a teacher dismay'd?
Not tho' the instructor knew
Students had blunder'd:
Theirs not to make reply,
Theirs not to reason why,
Theirs but to read and die:
Into the valley of Death
Rode the six hundred.

Exams to right of them,
Exams to left of them,
Exams in front of them
Written and drawn;
Storm'd at with zero’s and dashes,
Boldly they rode and well,
Into the jaws of Death,
Into the mouth of Hell
Rode the six hundred.

Flash'd all their pencil’s bare,
Flash'd as they scored from chair,
Sabring the students there,
Charging an army, while
All the world wonder'd:
Plunged toward the mountain o’exams
Right thro' the line they broke;
New seniors and college freshmen
Reel'd from the pencil strokes
Shatter'd and sunder'd.
Then they rode back, but not
Not the six hundred.

Exams to right of them,
Exams to left of them,
Exams behind them
Written and drawn;
Storm'd at with zeros and dashes,
While TL and Reader fell,
They that had read so well
Came thro' the jaws of Death
Back from the mouth of Hell,
All that was left of them,
Left of six hundred.

When can their glory fade?
O the wild scoring they made!
All the world wondered.
Honor the charge they made,
Honor the AP readers,
Noble six hundred.

25 April 2013

Competency-Based Grading in Introductory Political Science Courses

Last night’s discussion on #fycchat was, appropriately for this time of year, about grading. I briefly mentioned, in response to a thread between Lee Skallerup and Jessica Nastral about grading anxiety, that I have experimented with a system of competency-based grading in my introductory political science courses. That generated some interest, so I’ll elaborate here.

The approach is an alternative to the traditional accumulation of points approach. That, it seems to me, assumes that the difference between an A student and a C student was knowing 20% more content. That, to me, never really made sense; it seems more like an A student should be able to do things a C student can’t. Accumulation of points reduces learning to mastery of the lowest levels of Bloom’s cognitive domain: knowledge and comprehension. Especially when I have more knowledge in my pocket than my college professors had in their heads collectively, learning should be about moving students to higher levels of the cognitive domain, teaching them to apply, analyze, synthesize, and evaluate.

The competency-based approach that I use starts from the course objectives rather than the assignments (which, if assignments are about assessing learning, every course should do). The objectives are framed in two dimensions: course content and course skills. The course skills are based on being able to use content at increasingly higher levels of Bloom’s cognitive domain. From my Introduction to Comparative Politics Syllabus:
Core Concepts. Upon successful completion of this course, the student will be able to demonstrate comprehension of and the ability to apply, analyze, and evaluate the following:
  • The basic methods of reasoning and analysis in comparative politics
  • The ideas through which comparative politics understands states, societies, and economies across regime types. 
  • The practices and structures that differentiate democratic and non-democratic regimes. 
  • The processes of political development and revolution.
  • The politics and recent political history of one Arab country in the Middle East or North Africa. 
Course Competencies. Students who satisfactorily complete this course will demonstrate the following skills with regard to the core concepts studied:
  1. Professionalism in the performance of their duties. 
  2. Comprehension of the core concepts of the course. 
  3. The ability to apply those core concepts such that they can understand, give explanations for, and develop responses to political practices, situations, and outcomes in national politics across different types of political systems. 
  4. The ability to analyze, synthesize, and evaluate those core concepts, both in themselves and in practice, such that they can add original material to those concepts.
I then design assignments to assess a specific competency. Comprehension is assessed with simple, open-book reading quizzes; I’m not really testing whether the students know content off the top of their heads but whether they can understand content that they find in whatever resources they’re using. That especially makes sense in comparative politics; the odds that students will ever need to remember the legislative process in France are slim but the odds that they’ll need to look up the legislative process in a foreign country and understand what they find are more substantial. The other assignments test progressively more demanding competencies. All are graded on a satisfactory/unsatisfactory/failing basis.
Assignments. All students will complete the following assignments:
  • Professionalism. Students must complete all assignments in good faith, on time, and in compliance with the ethical standards of scholarship in order to demonstrate mastery of competency 1. Submission of work that does not demonstrate a good faith effort to complete the assignment as required or that includes undocumented outside sources (whether or not in violation of academic conduct policies) in an assignment will constitute failure to demonstrate mastery of this competency. 
  • Quizzes. For each unit of the course, there will be an online quiz of 10 questions. The quiz is based strictly on the readings for the course. It may be completed at any time before the date specified below, and is open-book. A satisfactory score is eight correct answers. Satisfactory completion of quizzes demonstrates mastery of competency 2. 
  • Essay Exams. Students will complete two out-of-class essay exams. Each essay will require students to explain a concept studied in the course and apply that concept to explain or predict the outcome of a political case. The concept and the case will be defined in the question, and material about the case will accompany the question. A satisfactory essay will adequately explain the concept using course material and apply it to make an effective explanation of the case. A failing essay is one that does not reflect a good faith effort to complete the requirements of the assignment on time. Satisfactory completion of essays demonstrates mastery of competency 3. 
  • Country Study. Students will complete one country study as part of the larger class project examining the Arab Spring. The project will require students to explain a concept studied in the course, identify a hypothesis following from the concept regarding how the Arab Spring would be expected to progress in their chosen country of expertise, and determine the extent to which the course of the Arab Spring in that country supports the hypothesis. The paper will require outside research. Satisfactory completion of the country study demonstrates mastery of competency 3. 
  • Group Paper. Building on the country studies, students will, in groups, prepare a paper and class presentation developing a general theory explaining why the Arab Spring took different courses in different countries and testing that theory with respect to their countries of expertise. The paper is expected to be a single, coherent essay and not a collection of separate pieces. Students will receive a common grade for their entire group. Satisfactory completion of the group paper demonstrates mastery of competency 4. 
This translates into a straightforward grading system. Satisfactory completion of assignments allows a clear determination of achievement at each course competency. Each successively more demanding course competency is associated with a higher grade. Course grades thus indicate the ability to use the material in higher levels of the cognitive domain:
Course Grades. Grades will be assigned based on demonstrated mastery of competencies as follows: 
A. Student has demonstrated mastery of all competencies by, in addition to meeting all requirements for a B grade, receiving a satisfactory grade on the group paper and presentation. 
B. Student has demonstrated mastery of competencies 1-3 by, in addition to meeting all requirements for a C grade, receiving a satisfactory grade on both essay exams and the country study. 
C. Student has demonstrated mastery of competencies 1 and 2 by, in addition to meeting all requirements for a D grad, passing all quizzes. 
D. Student has demonstrated mastery of competency 1 and minimal mastery of competency 2 by passing six quizzes and receiving at least an unsatisfactory grade on all written assignments.
So far I’ve had positive feedback on this system, though I have no systematic evidence that students find it more useful. There is some confusion at first due to unfamiliarity, I think, but as students catch on they like the idea that their grades actually mean something concrete and don’t hinge on marginal differences in points. They’ve also said that they focus more on the big picture of both readings and assignments rather than on details that might shift their grade a few points. It does seem to me that students have done a better job of writing to the question rather than on the topic generally when I’ve used this; I don’t know if that’s because they are focused on the meeting the standard rather than maximizing points by putting everything they know about the topic on the page, but that seems a reasonable hypothesis.

Some students have said that they didn’t put in as much effort into assignments dealing with higher competencies because they’d be happy with an unsatisfactory score and a B or C in the course. But that’s fine with me. Students need to learn to prioritize their efforts, and that prioritizing comes with accepting less impressive outcomes on lower priorities. I think this helps them do that.

At the same time it makes my life easier when grading. I don’t need to think about whether an assignment is an 85 or an 88, only whether it constitutes a good-faith effort and meets the standard I defined in the syllabus. I can grade much faster that way (especially at the end of the term when students aren’t concerned with comments) and I can direct my comments to the interesting issues rather than to justifying every point not awarded. I also don’t have to worry about haggling for a couple of marginal points: if the assignment is unsatisfactory, there’s a pretty clear reason for it.

I also allow revision and resubmission of assignments. Partially that’s because every assignment would be make-or-break if I didn’t, but mostly because I believe that revision is the best tool for learning from an assignment. The competency-based system, with clear standards for each grade, makes that more workable. I can focus my detailed comments on the students who will actually put them to good use, discussing the assignments personally with the students who intend to revise and directing them to the standard rather than the minutiae. The revision process gives all students an incentive to take the comments seriously since they can make a major improvement in their grades on that assignment rather than making a marginal improvement or waiting for the next assignment and trying to generalize comments from previous work that they may not have really understood.

One thing that I think has to be done to make this work is to have rigorous standards. This could very much lend itself to grade inflation if your standard for an A is something that you think everyone should meet. I haven’t tried this in an upper-level course yet, but this would be especially so there. I could even be talked into dropping everything down a grade for courses at that level: comprehension alone is a D, application is necessary for a C, analysis and synthesis gets a B, and a serious critical evaluation of ideas is necessary for an A. That said, I think the points approach doesn’t avoid this problem; it only hides it behind the idea that each point not awarded is a deduction from 100% due to some problem, making mere satisfaction rather than excellence the standard.

Of course I only have locally-sourced, artisanal data anecdotal evidence that this works. I’d love to see others’ experiences with anything like this, and especially some actual research on it (though that presents a nightmare of a control problem, to be sure). I have heard of it being used in some other disciplines, primarily ones where there are relatively clear professional competencies such as education, accounting, or nursing. But I think this has good potential in the social sciences and humanities as well. Let me know if you attempt something like this.

21 March 2013

Injustice In, Injustice Out: Social Privilege in the Creation of Data

This is an except from my upcoming paper "From Open Data to Data Justice," which I'll present at the Midwest Political Science Association Conference in Chicago on April 13. Please join me if you're there.
Update: The full paper is available at both the MPSA conference program link above (which will only be available to members after the conference) and my SSRN archive, which is publicly available and will have the most up-to-date version of the paper.
Data is not reality. It is rather a construct, an operationalization of an actor’s concept and reality, interpreting between the physical world and the intellectual structures by which actors understand that world. GIS shapefies, for instance, operationalize a relationship between physically existing land and legally existing property; interaction with a census taker operationalizes a relationship between bodies and citizens. A key implication of the constructivist understanding of data is that, for all of the celebration of (and weeping and gnashing of teeth over) the purported ubiquity of data collection and data as the “detritus” of human life in contemporary affluent societies, data does not, in fact, simply happen. The constructed nature of data makes it quite possible for injustices to be embedded in the data itself. Whether by design or as unintended consequences, the process of constructing data builds social values and patterns of privilege into the data. Where those values and privileges are unjust, the injustice is then a characteristic of the data itself; no amount of openness can remedy such injustices, just as no amount of statistical processing can undo inaccuracies in the original data. “Garbage in, garbage out” is a central concept in data ethics.

Datized moments occur most often in the interaction of an individual with a bureaucratic organization such as the state or a business. But people and groups differ in their propensity to interact with such organizations. This difference provides an important point by which privilege can enter into data. Data over-represents some, and where those over-representations parallel existing structures of social privilege, it over-represents those already privileged and under-represents those less likely to be part of data producing interactions.

Interactions with the state are rife with disparities that reflect social privilege. One well-studied example is the undercount of the decennial United States Census. Since the problem of undercounting was first quantified in the mid-Twentieth Century, black and Hispanic households have been undercounted at higher rates than non-black households. The causes of this undercount are myriad:

Households are not missed in the census because they are black or Hispanic. They are missed where the Census Bureau’s address file has errors; where the household is made up of unrelated persons; where household members are seldom at home; where there is a low sense of civic responsibility and perhaps an active distrust of the government; where occupants have lived but a short time and will move again; where English is not spoken; where community ties are not strong. (Prewitt 2010, 245)
Two commonalities in these explanations are striking: the extent to which these causes are barriers to interaction with census takers, and the extent to which they are correlated with racial and class privilege. The latter causes the undercount to disproportionately affect disadvantaged groups (hence, Prewitt argues, the focus on race in debates over census methodology between 1980 and 2000), while the former prevents those groups from being represented accurately in Census data. Similar problems exist in collecting any data on groups such as the homeless. Groups might also be disproportionately willing to participate in some interactions over others, such as differences in thresholds for reporting building code violations between the affluent and poor.

Such privileges are not confined to interactions with the state. Residential segregation especially is often tied to forms of institutional discrimination that would influence how often individuals interact with bureaucracies. Zenk et al. (2005) found that low-income, predominantly African American neighborhoods in Detroit were, on average, 1.1 miles further from a supermarket than predominantly white neighborhoods with similar incomes, with consequently increased dependence on smaller food stores such as convenience stores or groceries. Similarly, Cohen-Cole (2011) argues that consumer credit discrimination based on the racial composition of applicants’ neighborhoods is linked to increased use of payday loans. In both cases, the use of less bureaucratized businesses by groups already suffering from discrimination in the form of de facto residential segregation (either as the legacy of formal segregation or because of ongoing discrimination) results in data that is statistically biased against such populations and reinforces whites’ privileged position. Businesses can analyze the needs of the (disproportionately white) customers with whom they interact and adapt accordingly; benefits thus accrue to the beneficiaries of social privilege.

Transforming information about a datized moment into data is equally problematic. Only some of the information about that moment will be datized. What information that will be is not a natural consequence of the interaction but a design choice on the part of the data architects that reflects their purposes, resources, and values. An institutional survey director noted to me that survey data at the institution is subject to state open records laws and sometimes requested by the public and state legislators. As a result, the survey director encouraged the practice of not collecting data that the institution would not be comfortable making public.# In this case the concern was privacy, but this reasoning is at least as likely when more self-interested motives are present. Regardless of the motivation, though, such decisions are value-laden; thus the data built on such decisions will embody those values and transmit them in the process of using the resulting data.

Less conscious assumptions such as those part of worldviews shaped by social privilege will also shape such decisions, and likely be less amenable to challenge to the extent of their invisibility to lack of diversity among the data collectors. Higher education “net price calculators” are a case in point. Such tools, which the federal government requires all institutions receiving Title IV aid to produce, are designed to help students and their families estimate the likely cost of attending an institution given the prevalence of “high-tuition, high-aid” business models. This assumes that the net price is what is important to students. But Sara Goldrick-Rab  argues that the gap in applications to elite colleges between high-achieving, high-income and high-achieving, low-income students reported by Hoxby and Avery is rooted in “sticker shock” at the high gross price of such institutions among low-income families in spite of the institutions’ often much lower net prices. Their disregard of net price is in part a lack of information, but more significantly a consequence of such families’ lack of trust in institutions generally and substantially higher risk to such families if educational institutions fail to maintain the initial promises of aid, conditions that make the net price of the institutions less credible: “Being told that a college is likely to give you aid is not the same thing as getting the aid, [emphasis in original]” she writes. Such students choose to apply at less expensive (and consequently less selective) institutions as they present less risk to themselves and their families.

If Goldrick-Rab is correct, the credibility that the middle class finds in state and social institutions that have generally protected their interests should be seen as underlying the decision to collect and report average aid amounts that do not vary my income: middle class families can credibly take average aid as typical of people like them; low-income families cannot. One might expect the same to be true of first-generation students. With family members unfamiliar with the operations of universities, they will often be unaware of issues such as net price or even understand the financial aid process at all. Yet this background knowledge, like the credibility of a measure, is assumed in the selection of data to be collected. Those privileged with such knowledge find their privileges reinforced by this data; those who are not so privileged are further disadvantaged when they cannot see the data as meaningful.

Adding to this the question of how that information is stored increases the complexity of the issue. Key features in the problematic Bhoomi experience with open data were not only the selection of only certain types of documentation for inclusion in the land title data but also the decision to store the resulting data in a relational database system. These aspects of the system design effectively precluded informal knowledge from being part of the open data system; such knowledge, which was the basis of the existing land claims of Dalits, cannot be queried by the systems used to . The two features both inform and reinforce each other: excluding narratives and other unstructured data obviates the need for systems that can handle unstructured data such as those using text-analytics or Unstructured Information Management Architecture (UIMA), while the choice of a relational database precludes the use of narrative information. Donovan (2012) cites this as an instance of James Scott’s (1998) “seeing like a state” in which the local government sought to simplify society by making it legible. The open data system incorporated this value in its choice of what to datize about the moment in which land was transferred. This incorporated a value structure into the data, one that is clearly not neutral in the competition for power.

Because of the myriad ways that social privilege can become embedded in data sets, open data cannot be expected to universally promote justice. It can just as easily marginalize groups that are not part of the data: people whose lack of privilege excludes them from the kinds of interactions that produce data and makes their viewpoints invisible to those who collect data. Opening datasets composed of such data simply propagates the injustices that came into the data as it was collected. Whatever steps are taken to promote fairness in using data that is at its root unjust, the result will almost inevitably be unjust as well. Data is very much a case of “Injustice in, injustice out.”