U of Texas will stop using controversial algorithm to evaluate Ph.D. applicants

In 2013, the University of Texas at Austin’s computer science department began using a machine-learning system called GRADE to help make decisions about who gets into its Ph.D. program — and who doesn’t. This year, the department abandoned it.

Before the announcement, which the department released in the form of a tweet reply, few had even heard of the program. Now, its critics — concerned about diversity, equity and fairness in admissions — say it should never have been used in the first place.

“Humans code these systems. Humans are encoding their own biases into these algorithms,” said Yasmeen Musthafa, a Ph.D. student in plasma physics at the University of California, Irvine, who rang alarm bells about the system on Twitter. “What would UT Austin CS department have looked like without GRADE? We’ll never know.”

GRADE (which stands for GRaduate ADmissions Evaluator) was created by a UT faculty member and UT graduate student in computer science, originally to help the graduate admissions committee in the department save time. GRADE predicts how likely the admissions committee is to approve an applicant and expresses that prediction as a numerical score out of five. The system also explains what factors most impacted its decision.

The UT researchers who made GRADE trained it on a database of past admissions decisions. The system uses patterns from those decisions to calculate its scores for candidates.

For example, letters of recommendation containing the words “best,” “award,” “research” or “Ph.D.” are predictive of admission — and can lead to a higher score — while letters containing the words “good,” “class,” “programming” or “technology” are predictive of rejection. A higher grade point average means an applicant is more likely to be accepted, as does the name of an elite college or university on the résumé. Within the system, institutions were encoded into the categories “elite,” “good” and “other,” based on a survey of UT computer science faculty.

Every application GRADE scored during the seven years it was in use was still reviewed by at least one human committee member, UT Austin has said, but sometimes only one. Before GRADE, faculty members made multiple review passes over the pool. The system saved the committee time, according to its developers, by allowing faculty to focus on applicants on the cusp of admission or rejection and review applicants in descending order of quality.

For what it’s worth, GRADE did appear to successfully save the committee time. In the 2012 and 2013 application seasons, developers said in a paper about their work, it reduced the number of full reviews per candidate by 71 percent and cut the total time reviewing files by 74 percent. (One full review typically takes 10 to 30 minutes.) Between the years 2000 and 2012, applications to the computer science Ph.D. program grew from about 250 to nearly 650, though the number of faculty able to review those applications remained mostly constant. In the years since 2012, the number of applications has reached over 1,200.

The university’s use of the technology escaped attention for a number of years, until this month, when the physics department at the University of Maryland at College Park held a colloquium talk with the two creators of GRADE.

The talk gained attention on Twitter as graduate students accused GRADE’s creators of further disadvantaging underrepresented groups in the computer science admissions process.

“We put letters of recommendation in to try to lift people up who have maybe not great GPAs. We put a personal statement in the graduate application process to try to give marginalized folks a chance to have their voice heard,” said Musthafa, who is also a member of the Physics and Astronomy Anti-Racism Coalition. “The worst part about GRADE is that it throws that out completely.”

Advocates have long been concerned about the potential for human biases to be baked into or exacerbated by machine-learning algorithms. Algorithms are trained on data. When it comes to people, what those data look like is a result of historical inequity. Preferences for one type of person over another are often the result of conscious or unconscious bias.

That hasn’t stopped institutions from using machine-learning systems in hiring, policing and prison sentencing for a number of years now, often to great controversy.

“Every process is going to make some mistakes. The question is, where are those mistakes likely to be made and who is likely to suffer as a result of them?” said Manish Raghavan, a computer science Ph.D. candidate at Cornell University who has researched and written about bias in algorithms. “Likely those from underrepresented groups or people who don’t have the resources to be attending elite institutions.”

Though many women and people who are Black and Latinx have had successful careers in computer science, those groups are underrepresented in the field at large. In 2017, whites, Asians and nonresident aliens received 84 percent of degrees awarded for computer science in the United States.

At UT, nearly 80 percent of undergraduates in computer science in 2017 were men.

Raghavan said he was surprised that there appeared to be no effort to audit the impacts of GRADE, such as how scores differ across demographic groups.

GRADE’s creators have said that the system is only programmed to replicate what the admissions committee was doing prior to 2013, not to make better decisions than humans could. The system isn’t programmed to use race or gender to make its predictions, they’ve said. In fact, when given those features as options to help make its predictions, it chooses to give them zero weight. GRADE’s creators have said this is evidence that the committee’s decisions are gender and race neutral.

Detractors have countered this, arguing that race and gender can be encoded into other features of the application that the system uses. Women’s colleges and historically Black universities may be undervalued by the algorithm, they’ve said. Letters of recommendation are known to reflect gender bias, as recommenders are more likely to describe female students as “caring” rather than “assertive” or “trailblazing.”

In the Maryland talk, faculty raised their own concerns. What a committee is looking for might change each year. Letters of recommendation and personal statements should be thoughtfully considered, not turned into a bag of words, they said.

“I’m kind of shocked you did this experiment on your students,” Steve Rolston, chair of the physics department at Maryland, said during the talk. “You seem to have built a model that builds in whatever bias your committee had in 2013 and you’ve been using it ever since.”

In an interview, Rolston said graduate admissions can certainly be a challenge. His department receives over 800 graduate applications per year, which takes a good deal of time for faculty to evaluate. But, he said, his department would never use a tool like this.

“If I ask you to do a classifier of images and you’re looking for dogs, I can check afterwards that, yes, it did correctly identify dogs,” he said. “But when I’m asking for decisions about people, whether it’s graduate admissions, or hiring or prison sentencing, there’s no obvious correct answer. You train it, but you don’t know what the result is really telling you.”

Rolston said having at least one faculty member review each application was not a convincing safeguard.

“If I give you a file and say, ‘Well, the algorithm said this person shouldn’t be accepted,’ that will inevitably bias the way you look at it,” he said.

UT Austin has said GRADE was used to organize admissions decisions, rather than make them.

“It was never used to make decisions to admit or reject prospective students, as at least one faculty member directly evaluates applicants at each stage of the review process,” a spokesperson for the Graduate School said via email.

Despite the criticism around diversity and equity, UT Austin has said GRADE is being phased out because it is too difficult to maintain.

“Changes in the data and software environment made the system increasingly difficult to maintain, and its use was discontinued,” the spokesperson said via email. “The Graduate School works with graduate programs and faculty members across campus to promote holistic application review and reduce bias in admissions decisions.”

For Musthafa, the fact that GRADE may be gone for good does not impact the existing inequity in graduate admissions.

“The entire system is steeped in racism, sexism and ableism,” they said. “How many years of POC computer science students got denied [because of this]?”

Addressing that inequity — as well as the competitiveness that led to the creation of GRADE — may mean expanding committees, paying people for their time and giving Black and Latinx graduate students a voice in those decisions, they said. But automating cannot be part of that decision making.

“If we automate this to any extent, it’s just going to lock people out of academia,” Musthafa said. “The racism of today is being immortalized in the algorithms of tomorrow.”