Abu Ghraib and Witches: What Might Go Wrong If We Let AI Predict Violence from Images

What do pictures of the Abu Ghraib prison in Iraq and witches have in common? There are many ways to approach this question, but so far AI can only provide us with an unsatisfactory one.

By Emma Rengers, PhD Candidate, University of Birmingham
CAIDE Summer Research Academy

Content warning: This article contains discussion of a graphic image depicting violence and abuse; this may be disturbing for some people.

As my work concerns computer vision technologies, I do not usually ponder the connections between Abu Ghraib and witches. This connection, surprisingly, was given to me by Google’s Cloud Vision API. The API allows users to input their own images. It then uses computer vision (a subfield of artificial intelligence) to predict the nature of the image, and what is pictured in the image. This includes an estimate of whether the image is likely to contain “violence.” We can imagine many applications for such estimates, including content moderation and law enforcement applications. Yet, we should have serious doubts about the possibility and the desirability of systems that claim to reliably detect violence from images.

Iraq

I was curious about how well this Cloud Vision API could recognise violence from images, so I did some (admittedly somewhat unmethodical) tests. Surprisingly, even the most obviously violent images I input in the API (images of war crimes, torture, riots, militaristic propaganda from fascist regimes and terrorist groups, etc.) were not classified as “violent.”

The most striking example was the famous picture from the then US-run Abu Ghraib prison in Iraq (2004), picturing a hooded detainee standing on top of a box, with electrical wires hooked up to his fingers. This is arguably one of the most famous images of modern torture. Those who know the context immediately recognise it, and even those who do not know exactly where the picture is from can guess that something bad is happening.

Google's computer vision tool did not see it the same way; the image was classified (see here) with alarmingly incorrect results. It associated the image of the hooded, tortured detainee with “witch hat” and “costume accessory.” It also predicted that the image was unlikely to contain violence. This is particularly problematic as this image is so horrifying to a human, yet completely misclassified by an algorithm.

What went wrong? AI enthusiasts might wonder whether Google’s computer vision tool just was not trained with enough data, or the right kind of data. Could this type of blunder have been avoided by a simple improvement of the training data? I believe not, as we have several reasons to be sceptical of such a solution.

Firstly, computer vision tools do not learn any substantive concepts or narratives. Computer vision works on the basis of the mathematical and statistical analysis of the pixel values in an image. They do not help the computer understand what is going on in an image.

Now what does recognising violence require? Violence, as we tend to understand it, is contextual. Kicking might be violent behaviour in a classroom, but not in a football game. Smashing windows would be seen as violent behaviour in most situations, but not in some emergency situations. What makes an act “violent” might depend on the alleged perpetrator’s intentions, adverse effects on potential victims, the extent to which the perpetrator is acting rationally, the power position of the perpetrator, and many more factors.[1] In other words, recognising violence requires social understanding.

Mosul

Creating reliable data on all these factors would not only require incredibly pervasive and granular surveillance of all sorts of social situations, it also requires many value judgements (what is an “adverse effect?” What is “rational behaviour?”). Improved data on the factors we use to recognise violence would therefore be undesirable, and inescapably subjective.

Secondly, even with perfect training data, computer vision systems might still overlook certain forms of violence. Computer vision recognises violence on the basis of purely visual cues. However, not all violence is directly visible. A social system that has successfully gained control over its subjects does not need to look violent. Consider a person in slavery, depicted picking cotton on a plantation. The image may not contain shackles or other signs of immediate violence, but the situation is nevertheless a direct result of an extremely violent institution. How does computer vision spot the difference between a person in slavery and a free person? Violence can be structural; embedded in our political practices, our beliefs, and our public and private spaces. Computer vision tools are not equipped to deal with “invisible” forms of violence. We need to reflect on what types of violence computer vision tools make visible, and what types of violence it may render “invisible” by classifying them as “unlikely” to be violent.

Now let us get back to Abu Ghraib and witches. The miscategorisation – describing the detainee as a person with a “witch hat” and a “cloak” – was incredibly misguided. Yet, paradoxically, it can also lead to understanding. There are indeed parallels between the way in which detainees have been treated by the US military in Abu Ghraib and Guantanamo Bay, and the earlier witch hunts in the Christian West. Both involved the torturing of bodies which were seen as representing a threat to civilised society – an almost ritual purification of society for the sake of the preservation of Western civilisation (see Keen, 2006, pp. 93-97). Seeing these kinds of analogies, or understanding one type of violence through the context of an older type of violence, is something we can do as humans when we are confronted with pictures like these. Contexts can help us “see” differently. However, it is important to understand that the computer's “wisdom” here is completely accidental and inadvertent – it is dependent on humans drawing those parallels, in order to understand the weight and violence of the image and events at Abu Ghraib.

It is imperative that we take seriously the concept of violence as the politically, socially, and historically situated phenomenon that it is. Therefore, we must be careful when delegating the detection of violence to computers. If we are not, the answer to the question “what do pictures of the Abu Ghraib prison in Iraq and witches have in common?”will remain an unsatisfactory “witch hat” and “costume accessory.”

Emma is a PhD researcher at the School of Law and the School of Computer Science at the University of Birmingham (UK). Her research focuses on the legal, ethical, and political implications of computer vision technologies.

[1] Violence “remains a complex, unwieldy and highly contested concept that is, frankly, not well understood” (Vorobej, 2016, p. 1). The exact conditions that make an act “violent” therefore depend on the school of thought one adheres to. Recognising the fulfilment of the conditions, however, remains an exercise in social understanding.

References

Keen, David (2006). “War Without End? Magic, Propaganda and the Hidden Functions of Counter-Terror” Journal of International Development 18: 87-104. https://doi.org/10.1002/jid.1264.

Vorobej, Mark (2016). The Concept of Violence. Routledge.

4 Feb 2021