Categories
Uncategorized

Issues in Computer Vision Data Collection

I finished my thesis! Check it out here. Below is an excerpt of the introduction.

On the Shoulders of ImageNet

Data underpins deep learning. Deep learning, by way of convolutional neural networks (CNNs), dominates modern computer vision research and applications. Since Krizhevsky et al. demonstrated in 2012 that high-capacity CNNs trained with large amounts of data on graphical processing units result in powerful image classification systems [29], the field of computer vision has widely adopted this paradigm. Deep learning has been applied to medical image segmentation [44], human pose estimation [51], face recognition [50], and many more tasks with great success, establishing the CNN as the preeminent method in computer vision. As progress in the research community has historically been measured by accuracy on benchmark datasets such as MNIST [4], ImageNet [9], and COCO [31], researchers are motivated to collect more data, use more computing power, and train higher-capacity CNNs to further the state-of-the-art in their respective domains. Individuals are not alone in this pursuit, however, as a tenet of the research community is the sharing of data. As computer vision systems advance from handcrafted feature-based methods to supervised deep learning, and now to weak-supervision [48, 34] and self-training [55], increasingly large datasets are in demand. To this end, data collection practices in the computer vision community have shifted dramatically in the past thirty years.

In the 1990s, datasets were collected by academics in laboratory settings [41, 33] and made available through partnerships with government agencies [4, 42]. As the consumer internet boomed at the turn of the century, online search engines and social media websites provided a new means of collection. Computer vision researchers moved online in the 2000s to collect images, manually annotating datasets such as Caltech-101 [13] and PASCAL VOC [11], but these datasets, on the order of tens of thousands of examples, pushed the limit of in-house annotation. Fortunately for researchers, Amazon Mechanical Turk (AMT), a website to hire remote crowdworkers to perform short, on-demand tasks, launched in 2005 and provided a solution. From 2007 to 2010, researchers at Stanford and Princeton used the crowdsourcing platform to task 49k “turkers” from 167 countries with annotating images to create the canonical ImageNet dataset of 14M images in 22k classes [12]. The associated ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was held annually from 2010 to 2017, created from a subset of 1k classes from the larger dataset. The research community coalesced around the ILSVRC as it presented an image classification problem an order of magnitude more difficult than its predecessors. Landmark work from Krizhevsky et al. [29] in the 2012 event was the catalyst for significant academic, industry, and state interest and investment in deep learning and artificial intelligence.

ImageNet and its challenge have been featured in the New York Times [35], cited more than 38k times [9, 45], and described as “the data that transformed AI research — and possibly the world.” [16] ImageNet’s success entrenched the use of crowdsourced annotations in the data collection pipeline, effectively solving the problem of large-scale data collection for the research community. This technique was subsequently used in collecting the object detection, segmentation and captioning dataset COCO [31], the human action classification dataset Kinetics [26], and the densely-annotated scene understanding dataset Visual Genome [28], among other widely-used datasets. The shift to web-scraped data and crowdsourced annotations, however, has not been without consequence.

The Abstraction of Data in Computer Vision

In the push for larger datasets to satisfy deep learning algorithms, careful considerations into the choices and assumptions underpinning data collection have largely been neglected. This is especially troubling in computer vision research, as many problem areas include the collection and interpretation images of human subjects, which brings with it issues of identity, privacy, and connotations of harmful classification systems from the past. The automation of data sourcing and annotation emphasizes efficiency and indiscriminate collection above scrutiny. As Jo and Gebru write, “Taking data in masses, without critiquing its origin, motivation, platform and potential impact results in minimally supervised data collection” [25]. The distributed workforce accessed through AMT and similar platforms is often treated as a homogeneous, interchangeable group of annotators, ignoring cultural differences that can lead to different labels from different groups. Further, publications announcing datasets seldom provide rationale for the many value-laden decisions that were made in their composition [15], such as classification taxonomy and hierarchy, data source selection and representation, annotator instructions, compensation and demographics, and many more. These decisions embed biases and assumptions into data but are largely ignored as the focus of the community is on the product of data collection, not the process. While abstraction, the process of reducing complexity by considering something independent of its details [7], is a powerful concept in computer science, abstracting the social context away from data collection removes important details that are crucial to understand how a dataset represents the world.

In a similar vein, published datasets are largely accepted by practitioners for use in theory and application papers without scrutiny. With repeated use, datasets become viewed as neural scientific objects, the many subjective decisions that went into their construction rarely contested by the community [14]. For many in computer vision, the actual images in ImageNet have been abstracted away, replaced with a testing suite that evaluates the performance of an image classification model on the command line. The notion that the 1k classes in the widely-used 2012 ILSVRC subset of ImageNet are well-selected to act as the gold-standard benchmark for image classification is in itself an assumption that is uncontested.

Issues with Current Data Practices

The lack of rigour in the collection and the lack of scrutiny in the use of datasets in computer vision lead to consequences that are far-reaching.

Bias

Undesirable biases are patterns or behaviours learned from data that are highly influential in the decisions of a model, but not aligned with the values (or idealized values) of the society in which the model operates [47]. Bias in models arise from many different sources in the machine learning development process and can occur with respect to age, gender, race, or the intersections of these and other protected attributes [49]. One source of bias occurs when training data underrepresents some subset of the population that the model sees as input when it is deployed. Many face recognition datasets have been shown to display this so-called representational bias, leading to poor performance of derived models on Black people, specifically Black women [5, 43, 37]. Such bias is very concerning as face recognition models are actively used by law enforcement agencies across the world, with reports of false positive identifications leading to wrongful arrests, as was the case with Robert Williams by Detroit police in January 2020 [21]. Likewise, some state-of-the-art object detection systems have been demonstrated to have worse performance in identifying pedestrians with darker skin tones [52]. As many autonomous driving companies rely on CNNs for visual understanding of the world, these reports are concerning.

Efforts to gather more diverse data to increase representation can prove difficult, however, as entrenched inequalities in society are often present at the source of collection. Historical bias is another means by which undesirable bias can be embedded in a computer vision system, as this bias exists given perfect sampling of the data source, a consequence of deep-rooted systemic unfairness [49]. Labeled Faces in the Wild [23], for example, is a gold standard benchmark in face verification [57]. It was sourced through images and captions of notable people in Yahoo! News stories from 2002 to 2004 and was estimated to contain 77.5% male and 83.5% white individuals [17]. This highly-skewed representation is a result of a Western-focused media source that brings with it a patriarchy and history of systemic racism that undervalues women and people of colour in leadership positions in business, politics, academia, entertainment, and other newsworthy professions. The decision to select identities in this manner embedded a historical bias in the dataset, of which no steps were taken to mitigate.

Consent

Consent and privacy are notions not well addressed by computer vision practitioners in web-based data collection. In Canada, research involving human subjects is exempt from Research Ethics Board review when it “relies on information that is in the public domain and the individuals to whom the information refers have no reasonable expectation of privacy” [6]. But to what extent do individuals give up their privacy expectation when they post content online, or when others post content of them without their consent? Critics of current research ethics regulations say the advent of big data dramatically changes the research ethics landscape, yet regulations have not been updated to address new challenges of web-based data collection [38]. Researchers often rationalize the collection of data in face recognition, for example, by restricting datasets to celebrity identities, as they view these individuals to have lower expectations of privacy, but this is not always the case. Some researchers provide a means for individuals, celebrity or not, to opt-out of inclusion in face datasets, signalling an appreciation of the non-consensual nature of their collection.

“we are committed to protecting the privacy of individuals who do not wish their photos to be included.”

https://github.com/NVlabs/ffhq-dataset/

“Please contact us if you are a celebrity but do not want to be included in this data set. We will remove related entries by request”

https://web.archive.org/web/20180218212120/http://www.msceleb.org/download/sampleset

However, the onus remains on the individual to uncover their inclusion in such datasets, which are often restricted to approved researchers.

In some jurisdictions however, individuals have legal protections against the non-consensual analysis of their face. The Biometric Information Privacy Act (BIPA) [24] is an Illinois State law enacted in 2008 that gives residents the right to seek financial compensation from private companies who conduct biometric analysis without obtaining informed consent, specifically mentioning face scans as a protected biometric. Potential financial liabilities of popular face recognition dataset MegaFace [27] were recently raised by legal experts in a New York Times expose [22]. MegaFace, created by researchers at the University of Washington in 2016, was collected through publicly available images of non-celebrities on Flickr. It was taken offline in April 2020. Even as public discourse around data collection and consent increases, little effort has been displayed by computer vision researchers to engage with these issues in their work. As Solon writes in an NBC News report on the ethics of face recognition datasets, “It was difficult to find academics who would speak on the record about the origins of their training datasets; many have advanced their research using collections of images scraped from the web without explicit licensing or informed consent” [46]

Label Taxonomy

Labels in a dataset are often referred to as “ground truth” [45, 31, 28], yet this terminology often provides a veil of objectivity for annotations that are stereotypical, subjective, and lack scientific foundations linking them to images.

Datasets frame problems through their classification schema. ImageNet draws its taxonomy from WordNet [39], a lexical database developed in the 1990s at Princeton that organizes sets of synonyms, or “synsets”, into semantically meaningful relationships, each expressing a distinct concept. During ImageNet’s construction, 80k noun synsets were filtered through algorithmic and manual methods to arrive at 22k classes [9]. However, the extent to which each class can be characterized visually varies considerably. While a football player or scuba diver evoke clear visual pictures, a stakeholder or hobbyist cannot. Worse, some classes in ImageNet such as a debtor, snob, and good person promote stereotypes and ideas of physiognomy, the pseudoscientific assertion that one’s personal essential character can be gathered from their outer appearance [8]. While ImageNet authors have made recent attempts to rectify this situation by removing explicitly offensive and non-imageable classes and diversifying others, this work comes more than ten years after the dataset’s release and widespread use in the research community [56].

Annotations such as those in facial attractiveness dataset SCUT-FBP, which assigns an attractiveness label between one and five to 500 Asian women [54], prove problematic as they launder subjectivity through data. While research suggests some elements of faces such as facial symmetry are found universally attractive, perhaps as an evolutionary indicator of good health [32], this is far from absolute. The notion of beauty varies in time, geography, culture, and between individuals, so any attempt to create annotations that are treated as “ground truth” in perpetuity is fraught. The authors’ attempt to mitigate this subjectivity by averaging results from several annotators speaks to the fundamental uncertainty in the annotation task. Taxonomy issues notwithstanding, the inclusion of only women in the SCUT-FBP dataset promotes objectification, especially considering only 13% of subjects in the database were captured by the researchers themselves, the remainder collected from the web without consent.

Physiognomy appears again in work by Wu and Zhang [53], entitled Automated Inference on Criminality using Face Images, in which face images are annotated as criminals and non-criminals in order to automate their identification with deep learning. An unwritten assumption in this work is that criminality is an innate class of individuals, linked to genetics, that manifests in the face. This line of thinking discounts an entire body of behavioural and social sciences that examines how socioeconomic status, lived experiences, environment, and other factors may impact criminality [30]. While the technical claims of this study have been rebuked [1], bigger questions arise with respect to the motivations and ethical implications of this work. Although the authors of this study claim their work is “only intended for pure academic discussions” and motivated by a curiosity of the visual capabilities of machine learning systems, such statements promote a problematic “view from nowhere” that discounts the world in which their research exists and power imbalances therein, a perspective of scientific objectivity thoroughly critiqued by feminist scholars [18]. As companies such as Faception claim the ability to identify terrorists and pedophiles from face images, research that is earnestly conducted out of curiosity can embolden commercialization and perpetuate harm, which is not evenly distributed in our unjust world, especially when used by a law enforcement establishment with a history of systemic racism [40, 36]. While an egregious case of a lack of research into domain-specific literature, this study is emblematic of a larger problem with data annotation that can uphold a visual relationship between an image and its label that is not founded in science.

Thesis Overview

This work aims to explore issues of bias, consent, and label taxonomy in computer vision through novel investigations into widely-used datasets in image classification, face recognition, and facial expression recognition. Through this work, I aim to challenge researchers to reconsider normative data collection and use practices such that computer vision systems can be developed in a more thoughtful and responsible manner.

Motivation

ImageNet [9] ushered in a flood of academic, industry, and state interest in deep learning and artificial intelligence. Despite ImageNet’s significance, in the ten years following its publication at the leading computer vision conference CVPR in 2009, there was never a comprehensive investigation into the demo- graphics of the human subjects contained within the dataset. This is concerning from a pragmatic perspective, as models trained on ImageNet are widely used by computer vision practitioners in transfer learning, the practice of applying knowledge acquired in one task to a different, but related problem. If certain groups are underrepresented in ImageNet, downstream models may inherit a biased understanding of the world. The extent to which possible biases are retained in models when trained on new datasets is an open question that cannot be answered until ImageNet is well-understood. From a cultural perspective, the lack of scrutiny into ImageNet is a prime example of how datasets are uncontested after publication. ImageNet has been championed as one of the most important breakthroughs in artificial intelligence and its achievements should indeed be celebrated, however it appears that either its success or a complacency in researchers lead to it not being studied critically for more than ten years, both of which are cause for concern. With this motivation I present Chapter 2 of this thesis, wherein I explore the question of bias in ImageNet by introducing a framework for the audit of large-scale image datasets.

With the advent of web-scraped data, informed consent in the collection of human subjects in face recognition datasets has been largely ignored. As such, modern datasets count in the millions of images and in the hundreds of thousands of identities. State-of-the-art face recognition systems leverage these large collections of specific individuals’ faces to train CNNs to learn an embedding space that maps an arbitrary individual’s face to a vector representation of their identity. The performance of a face recognition system is directly related to the ability of its embedding space to discriminate between identities, ergo, the size of its dataset. Recently, there has been significant public scrutiny into the source and privacy implications of large-scale face recognition datasets such as MS-Celeb-1M and MegaFace [19, 46, 22]. In 2005, an image of five-year-old Chloe Papa was uploaded to Flickr by their mother. In 2016, it was scraped and included in MegaFace. In 2019, Papa said to the New York Times regarding their inclusion, “It’s gross and uncomfortable, I think artificial intelligence is cool and I want it to be smarter, but generally you ask people to participate in research. I learned that in high school biology” [22]. Many people are uncomfortable with their face being used to advance dual-use technologies such as face recognition that can enable mass surveillance. But is there a demonstrated impact of being included in such datasets? Are those included in the training sets of face recognition systems at a higher likelihood of being identified? This question has not previously been studied. In Chapter 3 of this thesis, I conduct experiments on a state-of-the-art face recognition system in an attempt to answer this question and further the discussion of privacy and consent in the context of data collection.

Facial expression recognition aims to predict the emotion a person is experiencing by analyzing images of their face. Research in this domain is built upon the work on Paul Ekman, a psychologist and researcher who has studied the relationship between emotions and facial expressions for more than 60 years. Ekman contends that a person’s emotional state can be readily inferred from their face due to the universality of six basic emotions, consistent across cultures and individuals [10]. A landmark review study published in July 2019, however, says otherwise [2]. The review, spearheaded by psychologist Lisa Feldman Barrett, analyzed over 1,000 research papers that studied healthy adults across cultures, newborns and young children, and people who are congenitally blind to determine the reliability and specificity of facial expressions in identifying emotions. Their findings vehemently refute Ekman’s claims. When interacting with others, we do not just rely on their face to try to infer their emotional state, but body language, tone of voice, word choice, situational context, our relationship, cultural norms, and other factors contribute to our ability to do so. In its current problem formulation, facial expression recognition with computer vision abstracts all of this context away, reducing the complex task to a classification problem with static images. While people may smile when happy, the use of the label “happy” on a static image of a grinning face does not have a solid scientific foundation. With firms such as HireVue using facial expression recognition to screen candidates in job interviews [20], continued research in its current form can bolster unproven technologies that have considerable impacts in people’s lives. In Chapter 4 of this thesis, I revisit the canonical Japanese Female Facial Expression (JAFFE) dataset, widely used in facial expression recognition research, and analyze its collection and use in the context of the aforementioned review, in the hopes of communicating these findings to a larger audience.


References

[1]  Blaise Aguera y Arcas, Margaret Mitchell, and Alexander Todorov. “Physiognomy’s new clothes”. In: Medium (6 May 2017), online: https://medium.com/@blaisea/physiognomys-new-clothesf2d4b59fdd6a (2017).

[2]  Lisa Feldman Barrett et al. “Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements”. In: Psychological Science in the Public Interest 20.1 (2019), pp. 1–68.

[3]  Ruha Benjamin. “2020 Vision: Reimagining the Default Settings of Technology & Society”. In: The International Conference on Learning Representations, Invited Speaker. 2020.

[4]  L ́eon Bottou et al. “Comparison of classifier methods: a case study in handwritten digit recognition”. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3-Conference C: Signal Processing (Cat. No. 94CH3440-5). Vol. 2. IEEE. 1994, pp. 77–82.

[5]  Joy Buolamwini and Timnit Gebru. “Gender shades: Intersectional accuracy disparities in commercial gender classification”. In: Conference on Fairness, Accountability and Transparency. 2018, pp. 77–91.

[6]  Canadian Institutes of Health Research, Natural Sciences and Engineering Research Council of Canada, and Social Sciences and Humanities Research Council. Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans. http://www.pre.ethics.gc.ca/eng/documents/ tcps2-2018-en-interactive-final.pdf. Dec. 2018.

[7]  Timothy Colburn and Gary Shute. “Abstraction in computer science”. In: Minds and Machines 17.2 (2007), pp. 169–184.

[8]  Kate Crawford and Trevor Paglen. Excavating AI: The Politics of Training Sets for Machine Learning. https://excavating.ai. Sept. 2019.

[9]  Jia Deng et al. “ImageNet: A large-scale hierarchical image database”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2009, pp. 248–255.

[10]  Paul Ekman. “An argument for basic emotions”. In: Cognition & Emotion 6.3-4 (1992), pp. 169–200.

[11]  Mark Everingham et al. “The pascal visual object classes (VOC) challenge”. In: International Journal of Computer Vision 88.2 (2010), pp. 303– 338.

[12]  Li Fei-Fei and Jia Deng. “ImageNet: Where are we going? and where have we been”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2017.

[13]  Li Fei-Fei, Rob Fergus, and Pietro Perona. “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2004, pp. 178– 178.

[14]  Timnit Gebru and Emily Denton. Tutorial on Fairness, Accountability, Transparency and Ethics in Computer Vision at CVPR 2020. https://sites.google.com/view/fatecv-tutorial/home. June 2020.

[15]  Timnit Gebru et al. “Datasheets for datasets”. In: arXiv preprint arXiv:1803.09010 (2018).

[16]  Dave Gershgorn. “The data that transformed AI research—and possibly the world”. In: Quartz 26 (2017), p. 2017.

[17]  Hu Han and Anil K. Jain. Age, Gender and Race Estimation from Un- constrained Face Images. Tech. rep. MSU-CSE-14-5. Michigan State Uni- versity, 2014.

[18]  Donna Haraway. Simians, cyborgs, and women: The reinvention of nature. Routledge, 2013.

[19]  Adam Harvey and Jules LaPlace. MegaPixels: Origins, Ethics, and Pri- vacy Implications of Publicly Available Face Recognition Image Datasets. 2019. url: https://megapixels.cc/ (visited on 04/18/2019).

[20]  Drew Harwell. “A face-scanning algorithm increasingly decides whether you deserve the job”. In: The Washington Post (Nov. 2019). url: https://www.washingtonpost.com/technology/2019/10/22/ai-hiring-face-scanning-algorithm-increasingly-decides-whether-you-deserve-job/.

[21]  Kashmir Hill. “Wrongfully Accused by an Algorithm”. In: The New York Times (June 2020). url: https://www.nytimes.co/2020/06/24/ technology/facial-recognition-arrest.html.

[22]  Kashmir Hill and Aaron Krolik. “How Photos of Your Kids Are Powering Surveillance Technology”. In: The New York Times (Oct. 2019). url: https://www.nytimes.com/interactive/2019/10/11/technology/flickr-facial-recognition.html.

[23]  Gary B. Huang et al. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Tech. rep. 07-49. University of Massachusetts, Amherst, Oct. 2007.

[24]  Illinois General Assembly. 740 ILCS 14 / Biometric Information Privacy Act. http://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3004&ChapterID=57. Oct. 2008.

[25]  Eun Seo Jo and Timnit Gebru. “Lessons from archives: strategies for collecting sociocultural data in machine learning”. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 2020, pp. 306– 316.

[26]  Will Kay et al. “The kinetics human action video dataset”. In: arXiv preprint arXiv:1705.06950 (2017).

[27]  Ira Kemelmacher-Shlizerman et al. “The megaface benchmark: 1 million faces for recognition at scale”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 4873–4882.

[28]  Ranjay Krishna et al. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. In: 2016. url: https://arxiv.org/abs/1602.07332.

[29]  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet classi- fication with deep convolutional neural networks”. In: Advances in Neural Information Processing Systems. 2012, pp. 1097–1105.

[30]  J Robert Lilly, Francis T Cullen, and Richard A Ball. Criminological theory: Context and consequences. Sage publications, 2018.

[31]  Tsung-Yi Lin et al. “Microsoft COCO: Common objects in context”. In: Proceedings of the European Conference on Computer Vision. Springer. 2014, pp. 740–755.

[32]  Anthony C Little, Benedict C Jones, and Lisa M DeBruine. “Facial attractiveness: evolutionary based research”. In: Philosophical Transactions of the Royal Society B: Biological Sciences 366.1571 (2011), pp. 1638–1659.

[33]  Michael J Lyons et al. “The Japanese female facial expression (JAFFE) database”. In: Proceedings of Third International Conference on Automatic Face and Gesture Recognition. 1998, pp. 14–16.

[34]  Dhruv Mahajan et al. “Exploring the limits of weakly supervised pretraining”. In: Proceedings of the European Conference on Computer Vision. 2018, pp. 181–196.

[35]  John Markoff. “Seeking a Better Way to Find Web Images”. In: The New York Times (2012).

[36]  Yunliang Meng. “Racially biased policing and neighborhood characteristics: A Case Study in Toronto, Canada”. In: Cybergeo: European Journal of Geography (2014).

[37]  Michele Merler et al. “Diversity in faces”. In: arXiv preprint arXiv:1901.10436 (2019).

[38]  Jacob Metcalf and Kate Crawford. “Where are human subjects in big data research? The emerging ethics divide”. In: Big Data & Society 3.1 (2016), p. 2053951716650211.

[39]  George A Miller. “WordNet: a lexical database for English”. In: Communications of the ACM 38.11 (1995), pp. 39–41.

[40]  Clayton James Mosher. Discrimination and denial: Systemic racism in Ontario’s legal and criminal justice systems, 1892-1961. University of Toronto Press, 1998.

[41]  Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. “Columbia object image library (COIL-20)”. In: ().

[42]  P Jonathon Phillips et al. “The FERET database and evaluation procedure for face-recognition algorithms”. In: Image and Vision Computing 16.5 (1998), pp. 295–306.

[43]  Inioluwa Deborah Raji and Joy Buolamwini. “Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products”. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. AIES ’19. 2019, pp. 429–435.

[44]  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation”. In: International Conference on Medical image computing and computer-assisted intervention. Springer. 2015, pp. 234–241.

[45]  Olga Russakovsky et al. “ImageNet large scale visual recognition challenge”. In: International Journal of Computer Vision 115.3 (2015), pp. 211– 252.

[46]  Olivia Solon. “Facial recognition’s ’dirty little secret’: Millions of online photos scraped without consent”. In: NBCNews.com (Mar. 2019). url: https://www.nbcnews.com/tech/internet/facial-recognition-s- dirty-little-secret-millions-online-photos-scraped-n981921.

[47]  Pierre Stock and Moustapha Cisse. “Convnets and ImageNet beyond accuracy: Understanding mistakes and uncovering biases”. In: Proceedings of the European Conference on Computer Vision. 2018, pp. 498–512.

[48]  Chen Sun et al. “Revisiting unreasonable effectiveness of data in deep learning era”. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, pp. 843–852.

[49]  Harini Suresh and John V Guttag. “A framework for understanding unintended consequences of machine learning”. In: arXiv preprint arXiv:1901.10002 (2019).

[50]  Yaniv Taigman et al. “Deepface: Closing the gap to human-level performance in face verification”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 1701–1708.

[51]  Alexander Toshev and Christian Szegedy. “Deeppose: Human pose estimation via deep neural networks”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 1653–1660.

[52]  Benjamin Wilson, Judy Hoffman, and Jamie Morgenstern. “Predictive inequity in object detection”. In: arXiv preprint arXiv:1902.11097 (2019).

[53]  Xiaolin Wu and Xi Zhang. “Automated inference on criminality using face images”. In: ().

[54]  Duorui Xie et al. “SCUT-FBP: A benchmark dataset for facial beauty perception”. In: 2015 IEEE International Conference on Systems, Man, and Cybernetics. IEEE. 2015, pp. 1821–1826.

[55]  Qizhe Xie et al. “Self-training with noisy student improves ImageNet classification”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2020, pp. 10687–10698.

[56]  Kaiyu Yang et al. “Towards fairer datasets: Filtering and balancing the distribution of the people subtree in the ImageNet hierarchy”. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 2020, pp. 547–558.

[57]  Ting Zhang. “Facial expression recognition based on deep learning: a survey”. In: International Conference on Intelligent and Interactive Systems and Applications. Springer. 2017, pp. 345–352.