Issues in Computer Vision Data Collection

I finished my thesis! Check it out here. Below is an excerpt of the introduction.

On the Shoulders of ImageNet

Data underpins deep learning. Deep learning, by way of convolutional neural networks (CNNs), dominates modern computer vision research and applications. Since Krizhevsky et al. demonstrated in 2012 that high-capacity CNNs trained with large amounts of data on graphical processing units result in powerful image classification systems [29], the field of computer vision has widely adopted this paradigm. Deep learning has been applied to medical image segmentation [44], human pose estimation [51], face recognition [50], and many more tasks with great success, establishing the CNN as the preeminent method in computer vision. As progress in the research community has historically been measured by accuracy on benchmark datasets such as MNIST [4], ImageNet [9], and COCO [31], researchers are motivated to collect more data, use more computing power, and train higher-capacity CNNs to further the state-of-the-art in their respective domains. Individuals are not alone in this pursuit, however, as a tenet of the research community is the sharing of data. As computer vision systems advance from handcrafted feature-based methods to supervised deep learning, and now to weak-supervision [48, 34] and self-training [55], increasingly large datasets are in demand. To this end, data collection practices in the computer vision community have shifted dramatically in the past thirty years.

In the 1990s, datasets were collected by academics in laboratory settings [41, 33] and made available through partnerships with government agencies [4, 42]. As the consumer internet boomed at the turn of the century, online search engines and social media websites provided a new means of collection. Computer vision researchers moved online in the 2000s to collect images, manually annotating datasets such as Caltech-101 [13] and PASCAL VOC [11], but these datasets, on the order of tens of thousands of examples, pushed the limit of in-house annotation. Fortunately for researchers, Amazon Mechanical Turk (AMT), a website to hire remote crowdworkers to perform short, on-demand tasks, launched in 2005 and provided a solution. From 2007 to 2010, researchers at Stanford and Princeton used the crowdsourcing platform to task 49k “turkers” from 167 countries with annotating images to create the canonical ImageNet dataset of 14M images in 22k classes [12]. The associated ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was held annually from 2010 to 2017, created from a subset of 1k classes from the larger dataset. The research community coalesced around the ILSVRC as it presented an image classification problem an order of magnitude more difficult than its predecessors. Landmark work from Krizhevsky et al. [29] in the 2012 event was the catalyst for significant academic, industry, and state interest and investment in deep learning and artificial intelligence.

ImageNet and its challenge have been featured in the New York Times [35], cited more than 38k times [9, 45], and described as “the data that transformed AI research — and possibly the world.” [16] ImageNet’s success entrenched the use of crowdsourced annotations in the data collection pipeline, effectively solving the problem of large-scale data collection for the research community. This technique was subsequently used in collecting the object detection, segmentation and captioning dataset COCO [31], the human action classification dataset Kinetics [26], and the densely-annotated scene understanding dataset Visual Genome [28], among other widely-used datasets. The shift to web-scraped data and crowdsourced annotations, however, has not been without consequence.


Thoughts on ImageNet Roulette

This post originated as a tweet thread – I’m reposting here for posterity.

A couple thoughts on this piece from Wired discussing the awesome ImageNet Roulette by @katecrawford and @trevorpaglen, mentioning some of my research analyzing ImageNet (and inc. a quote from me!)

First, I don’t think changing social norms from the development of WordNet in the 80s to today explain many of the issues with the taxonomy of ImageNet. As Kate and Trevor explain in , many classes are not merely outdated, but fundamentally offensive or immaterial (how do you visualize a “hypocrite”?), leading to the perpetuation of stereotypes and phrenological ideas.

This was true in the 80s and is true today. In the rush to crowdsource labels and get AI “to work”, these issues were overlooked. We’ve all seen many examples recently of technologists not considering the negative implications of their inventions – this is but another example.

Second, efforts by ImageNet creators to debias the “person” category are a good step, but removing data from public access in the interim, without transparency, is a bad way to ensure this important issue is not lost to history.

We are deemed to repeat these mistakes if we remove the “bad parts” from this story – I would hope the original data is still accessible in the future for researchers to study.

And again on debiasing the “person” class… what about the rest of the ImageNet? As noted in the article, the ImageNet Challenge subset (1k classes, conventionally used for pre-training) contains only 3 classes specifically of people: “scuba diver”, “groom” & “baseball player”…

However, in my studies, I’ve found that a person appears in closer to **27%** of all images in the ImageNet Challenge subset.

Top categories containing people by running an object detection model on the data (% of images in a category containing >= 1 person):

Lots of fish classes… people like photos of them holding fish! (typically older white men, that is)

As Cisse et al. note in , underrepresentation of black persons in ImageNet in classes other than “basketball” may lead to models learning a narrow and biased association.

So while problematic classes are not included in the subset of ImageNet used by practitioners, biases remain. What does this mean for ML development when the standard process is to start with a model pre-trained on this data?

Maybe nothing, as downstream tasks update these weights and potentially biased feature representations reduce to edge and shape detectors. But maybe something. I think it’s an area worth exploring.

TLDR: Even if problematic ImageNet classes are seldom used in practice, the many social & political implications of classification that this work brings to light are incredibly valuable as we move forward in the field.


Two Faces of Facial Recognition

Facial Recognition technology is quietly and quickly being applied to a variety of applications in the private and public sectors without regulation. This poses two important ethical issues to think about: what happens if the technology doesn’t work well, and what happens if the technology works too well?

In the first case, research has demonstrated that historically marginalized, minority populations are disparately impacted by errors of Facial Recognition technology. In Gender Shades, Buolamwini and Gebru demonstrated that commercial computer vision applications work very well for lighter-skinned males, but err highly for darker-skinned females. In a follow-up study one year later entitled Actionable Auditing, Raji and Buolamwini analyzed the impact of publicly naming and disclosing performance results of biased AI systems in Gender Shades. They found that the named companies all reduced accuracy disparities between males and females and darker and lighter-skinned subgroups (some better than others), but accuracy disparities for companies not named in Gender Shades remained high. The very same systems investigated by these researchers are currently being sold to private companies, governments and law enforcement agencies as the backbone of their Facial Recognition systems and their disparate errors have already had real-world impacts. In April 2019, a student in New York City sued Apple for $1 billion, claiming a Facial Recognition system used in their retail stores falsely linked him to a series of thefts, leading to his arrest. Suffice to say, the increased use of Facial Recognition technology has sparked an intense debate, with a call by the ACLU for a moratorium on the use of Facial Recognition technology for immigration and law enforcement purposes until Congress and the public debate what uses of the technology should be permitted. In May 2019, The City of San Francisco approved a ban of its use in law enforcement applications, and in the same month, Amazon shareholders voted on a proposal for the company to stop selling Facial Recognition technology to government agencies (it did not pass). Within the past three weeks, the United States House Committee on Oversight and Reform has held hearings on Facial Recognition technology, the first on its impact on civil rights and the second on ensuring transparency in government use. Closer to home, the Toronto Star reported in May that Toronto Police Services have been using a Facial Recognition system for the past year to assist in investigations. The continued application of Facial Recognition technology, despite its disparate performance, poses a threat for individuals, especially those historically marginalized, of being wrongly implicated in crimes. The inability to opt-out of this not-ready-for-primetime technology is problematic.

In the second case, the success of Facial Recognition technology in identifying and tracking people in public and private spaces can lead to a loss of privacy that erodes societal norms, of which I believe people are not fully aware of the implications. A recent article in the New York Times describing the Communist Party of China’s (CPC) use of Facial Recognition to track its citizens is a stark warning for the rest of the world of the implications of the technology’s widespread use. The article describes how the CPC is using Facial Recognition as the core of its surveillance apparatus to track, surveil and intern individuals of a largely Muslim minority group, the Uighurs. In another NYT piece, Sahil Chinoy chronicled his experience using Amazon’s commercial Facial Recognition service Rekognition to identify and track individuals, using only public sources of information, for a total cost of $60 USD. Using a publicly-available, live-streaming camera of Bryant Park in New York City, along with web-scraped images of employees in neighbouring business, the author was able to detect 2,750 faces in a nine-hour period, link several faces to their real-world identities and monitor their movement patterns. This experiment is an illustrative example of the ability of a non-expert, using only publicly available data and a very small budget to create a functioning Facial Recognition system – the success of governments and large private companies in creating invasive Facial Recognition systems, with experts in computer vision, access to massive datasets and large R&D budgets, is almost too incredible to imagine. Once again, the inability to opt-opt of this technology is highly problematic, given that it is already in use, but without public disclosure.

As a master’s student at the University of Waterloo studying computer vision, the societal impact of my work is top-of-mind. With the support of the Vector Scholarship in AI, the Alexander Graham Bell Canadian Graduate Scholarship, and that of my supervisors, I am pursuing the issues outlined above in my thesis work and in a project named Tin Foil AI.

My first research paper addressing the issue of disparate performance of Facial Recognition technology looks at the training data that fuels computer vision systems in a broad sense. Auditing ImageNet is the first in a series of works to develop a framework for the automated annotation of demographic attributes in large-scale image datasets, to be presented on June 17 at the Workshop on Fairness, Transparency, Accountability and Ethics in Computer Vision (FATE CV) at CVPR 2019. This project aims to scrutinize training data that is often abstracted away by computer vision practitioners such that imbalances in intersectional group representation can be quantified and their downstream effects on bias in trained neural networks can be studied.

On the other face of Facial Recognition, Tin Foil AI comes into play. I’m currently in the prototype development phase so I can’t speak too much on it right now, but I’m interested in finding a way to opt-out of this tech. Check out for more information and to sign-up for updates – if you have experience with adversarial attacks or defences and are concerned with unregulated Facial Recognition, please reach out!