This is Atlantic Intelligence, an eight-week series in which The Atlantic’s leading thinkers on AI will help you understand the complexity and opportunities of this groundbreaking technology. Sign up here.
The bedrock of the AI revolution is the internet, or more specifically, the ever-expanding bounty of data that the web makes available to train algorithms. ChatGPT, Midjourney, and other generative-AI models “learn” by detecting patterns in massive amounts of text, images, and videos scraped from the internet. The process entails hoovering up huge quantities of books, art, memes, and, inevitably, the troves of racist, sexist, and illicit material distributed across the web.
Earlier this week, Stanford researchers found a particularly alarming example of that toxicity: The largest publicly available image data set used to train AIs, LAION-5B, reportedly contains more than 1,000 images depicting the sexual abuse of children, out of more than 5 billion in total. A spokesperson for the data set’s creator, the nonprofit Large-scale Artificial Intelligence Open Network, told me in a written statement that it has a “zero tolerance policy for illegal content” and has temporarily halted the distribution of LAION-5B while it evaluates the report’s findings, although this and earlier versions of the data set have already trained prominent AI models.
Because they are free to download, the LAION data sets have been a key resource for start-ups and academics developing AI. It’s notable that researchers have the ability to peer into these data sets to find such awful material at all: There’s no way to know what content is harbored in similar but proprietary data sets from OpenAI, Google, Meta, and other tech companies. One of those researchers is Abeba Birhane, who has been scrutinizing the LAION data sets since the first version’s release, in 2021. Within six weeks, Birhane, a senior fellow at Mozilla who was then studying at University College Dublin, published a paper detailing her findings of sexist, pornographic, and explicit rape imagery in the data. “I’m really not surprised that they found child-sexual-abuse material” in the newest data set, Birhane, who studies algorithmic justice, told me yesterday.
Birhane and I discussed where the problematic content in giant data sets comes from, the dangers it presents, and why the work of detecting this material grows more challenging by the day. Read our conversation, edited for length and clarity, below.
— Matteo Wong, assistant editor
More Challenging By the Day
Matteo Wong: In 2021, you studied the LAION data set, which contained 400 million captioned images, and found evidence of sexual violence and other harmful material. What motivated that work?
Abeba Birhane: Because data sets are getting bigger and bigger, 400 million image-and-text pairs is no longer large. But two years ago, it was advertised as the biggest open-source multimodal data set. When I saw it being announced, I was very curious, and I took a peek. The more I looked into the data set, the more I saw really disturbing stuff.
We found there was a lot of misogyny. For example, any benign word that is remotely related to womanhood, like mama, auntie, beautiful—when you queried the data set with those types of terms, it returned a huge proportion of pornography. We also found images of rape, which was really emotionally heavy and intense work, because we were looking at images that are really disturbing. Alongside that audit, we also put forward a lot of questions about what the data-curation community and larger machine-learning community should do about it. We also later found that, as the size of the LAION data sets increased, so did hateful content. By implication, so does any problematic content.
Wong: This week, the biggest LAION data set was removed because of the finding that it contains child-sexual-abuse material. In the context of your earlier research, how do you view this finding?
Birhane: It did not surprise us. These are the issues that we have been highlighting since the first release of the data set. We need a lot more work on data-set auditing, so when I saw the Stanford report, it’s a welcome addition to a body of work that has been investigating these issues.
Wong: Research by yourself and others has continuously found some really abhorrent and often illegal material in these data sets. This may seem obvious, but why is that dangerous?
Birhane: Data sets are the backbone of any machine-learning system. AI didn’t come into vogue over the past 20 years only because of new theories or new methods. AI became ubiquitous mainly because of the internet, because that allowed for mass harvesting of large-scale data sets. If your data contains illegal stuff or problematic representation, then your model will necessarily inherit those issues, and your model output will reflect these problematic representations.
But if we take another step back, to some extent it’s also disappointing to see data sets like the LAION data set being removed. For example, the LAION data set came into existence because the creators wanted to replicate data sets inside big corporations—for example, what data sets used in OpenAI might look like.
Wong: Does this research suggest that tech companies, if they’re using similar methods to collect their data sets, might harbor similar problems?
Birhane: It’s very, very likely, given the findings of previous research. Scale comes at the cost of quality.
Wong: You’ve written about research you couldn’t do on these giant data sets because of the resources necessary. Does scale also come at the cost of auditability? That is, does it become less possible to understand what’s inside these data sets as they become larger?
Birhane: There is a huge asymmetry in terms of resource allocation, where it’s much easier to build stuff but a lot more taxing in terms of intellectual labor, emotional labor, and computational resources when it comes to cleaning up what’s already been assembled. If you look at the history of data-set creation and curation, say 15 to 20 years ago, the data sets were much smaller scale, but there was a lot of human attention that went into detoxifying them. But now, all that human attention to data sets has really disappeared, because these days a lot of that data sourcing has been automated. That makes it cost-effective if you want to build a data set, but the reverse side is that, because data sets are much larger now, they require a lot of resources, including computational resources, and it’s much more difficult to detoxify them and investigate them.
Wong: Data sets are getting bigger and harder to audit, but more and more people are using AI built on that data. What kind of support would you want to see for your work going forward?
Birhane: I would like to see a push for open-sourcing data sets—not just model architectures, but data itself. As horrible as open-source data sets are, if we don’t know how horrible they are, we can’t make them better.
Related:
P.S.
Struggling to find your travel-information and gift-receipt emails during the holidays? You’re not alone. Designing an algorithm to search your inbox is paradoxically much harder than making one to search the entire internet. My colleague Caroline Mimbs Nyce explored why in a recent article.
— Matteo