I finished my thesis! Check it out here. Below is an excerpt of the introduction.
On the Shoulders of ImageNet
Data underpins deep learning. Deep learning, by way of convolutional neural networks (CNNs), dominates modern computer vision research and applications. Since Krizhevsky et al. demonstrated in 2012 that high-capacity CNNs trained with large amounts of data on graphical processing units result in powerful image classification systems [29], the field of computer vision has widely adopted this paradigm. Deep learning has been applied to medical image segmentation [44], human pose estimation [51], face recognition [50], and many more tasks with great success, establishing the CNN as the preeminent method in computer vision. As progress in the research community has historically been measured by accuracy on benchmark datasets such as MNIST [4], ImageNet [9], and COCO [31], researchers are motivated to collect more data, use more computing power, and train higher-capacity CNNs to further the state-of-the-art in their respective domains. Individuals are not alone in this pursuit, however, as a tenet of the research community is the sharing of data. As computer vision systems advance from handcrafted feature-based methods to supervised deep learning, and now to weak-supervision [48, 34] and self-training [55], increasingly large datasets are in demand. To this end, data collection practices in the computer vision community have shifted dramatically in the past thirty years.
In the 1990s, datasets were collected by academics in laboratory settings [41, 33] and made available through partnerships with government agencies [4, 42]. As the consumer internet boomed at the turn of the century, online search engines and social media websites provided a new means of collection. Computer vision researchers moved online in the 2000s to collect images, manually annotating datasets such as Caltech-101 [13] and PASCAL VOC [11], but these datasets, on the order of tens of thousands of examples, pushed the limit of in-house annotation. Fortunately for researchers, Amazon Mechanical Turk (AMT), a website to hire remote crowdworkers to perform short, on-demand tasks, launched in 2005 and provided a solution. From 2007 to 2010, researchers at Stanford and Princeton used the crowdsourcing platform to task 49k “turkers” from 167 countries with annotating images to create the canonical ImageNet dataset of 14M images in 22k classes [12]. The associated ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was held annually from 2010 to 2017, created from a subset of 1k classes from the larger dataset. The research community coalesced around the ILSVRC as it presented an image classification problem an order of magnitude more difficult than its predecessors. Landmark work from Krizhevsky et al. [29] in the 2012 event was the catalyst for significant academic, industry, and state interest and investment in deep learning and artificial intelligence.
ImageNet and its challenge have been featured in the New York Times [35], cited more than 38k times [9, 45], and described as “the data that transformed AI research — and possibly the world.” [16] ImageNet’s success entrenched the use of crowdsourced annotations in the data collection pipeline, effectively solving the problem of large-scale data collection for the research community. This technique was subsequently used in collecting the object detection, segmentation and captioning dataset COCO [31], the human action classification dataset Kinetics [26], and the densely-annotated scene understanding dataset Visual Genome [28], among other widely-used datasets. The shift to web-scraped data and crowdsourced annotations, however, has not been without consequence.