Here we outline the computational and experimental techniques we use to compare gender bias in online images and texts. We begin by describing the methods of data collection and analyses developed for the observational component of our study. Then we detail the study design deployed in our online search experiment. The preregistration for our online experiment is available at https://osf.io/3jhzx. Note that this study is a successful replication of a previous study with a nearly identical design, except the original study did not include a control condition nor several versions of the text condition; the preregistration of the previous study is available at https://osf.io/26kbr.
Observational methods
Table of Contents
Data collection procedure for online images
Our crowdsourcing methodology consisted of four steps (Extended Data Fig. 1). First, we gathered all social categories in WordNet, a canonical lexical database of English. WordNet contained 3,495 social categories, including occupations (such as ‘physicist’) and generic social roles (such as ‘colleague’). Second, we collected the images associated with each category from both Google and Wikipedia. Third, we used Python’s OpenCV—a popular open-source deep learning framework—to extract the faces from each image; this algorithm automatically isolates each face and extracts a square including the entire face and minimal surrounding context. Using OpenCV to extract faces helped us to ensure that each face in each image was separately classified in a standardized manner, and to avoid subjective biases in coders’ decisions for which face to focus on and categorize in each image. Fourth, we hired 6,392 human coders from MTurk to classify the gender of the faces. Following earlier work, each face was classified by three unique annotators16,17, so that the gender of each face (‘male’ or ‘female’) could be identified based on the majority (modal) gender classification across three coders (we also gave coders the option of labelling the gender of faces as ‘non-binary’, but this option was only chosen in 2% of cases, so we excluded these data from our main analyses and recollected all classifications until each face was associated with three unique coders using either the ‘male’ or the ‘female’ label). Although coders were asked to label the gender of the face presented, our measure is agnostic to which features the coders used to determine their gender classifications; they may have used facial features, as well as features relating to the aesthetics of expressed gender such as hair or accessories. Each search was implemented from a fresh Google account with no prior history. Searches were run in August 2020 by ten distinct data servers in New York City. This study was approved by the Institutional Review Board at the University of California, Berkeley, and all participants provided informed consent.
To collect images from Google, we followed earlier work by retrieving the top 100 images that appeared when using each of the 3,495 categories to search for images using the public Google Images search engine16,17,18 (Google provides roughly 100 images for its initial search results). To collect images from Wikipedia, we identified the images associated with each social category in the 2021 Wikipedia-based Image Text Dataset (WIT)27. WIT maps all images across Wikipedia to textual descriptions on the basis of the title, content and metadata of the active Wikipedia articles in which they appear. WIT contained images associated with 1,523 social categories from WordNet across all English Wikipedia articles (see Supplementary Information section A.1.1 for details on our Wikipedia analysis). The coders identified 18% of images as not containing a human face; these were removed from our analyses. We also asked all annotators to complete an attention check, which involved choosing the correct answer to the common-sense question “What is the opposite of the word ‘down’?” from the following options: ‘Fish’, ‘Up’, ‘Monk’ and ‘Apple’. We removed the data from all annotators who failed an attention check (15%), and we continued collecting classifications until each image was associated with the judgements of three unique coders, all of whom passed the attention check.
Collecting human judgements of social categories
We hired a separate sample of 2,500 human coders from MTurk to complete a survey study in which they were presented with social categories (five categories per task) and asked to evaluate each category by means of the following question (each category was assessed by 20 unique human coders): “Which gender do you most expect to belong to this category?” This was answered as a scalar with a slider ranging from −1 (females) to 1 (males). All MTurkers were prescreened such that only US-based MTurkers who were fluent in English were invited to participate in this task.
Demographics of human coders
The human coders were all adults based in the USA who were fluent in English. Supplementary Table 1 indicates that our main results are robust to controlling for the demographic composition of our human coders. Among our coders, 44.2% identified as female, 50.6% as male and 3.2% as non-binary; the remainder preferred not to disclose. In terms of age, 42.6% identified as being 18–24 years, 22.9% as 25–34, 32.5% as 35–54, 1.6% as 55–74 and less than 1% as more than 75. In terms of race, 46.8% identified as Caucasian, 11.6% as African American, 17% as Asian, 9% as Hispanic and 10.3% as Native American; the remainder identified as either mixed race or preferred not to disclose. In terms of political ideology, 37.2% identified as conservative, 33.8% as liberal, 20.3% as independent and 3.9% as other; the remainder preferred not to disclose. In terms of annual income, 14.3% reported making less than US$10,000, 33.4% reported US$10,000–50,000, 22.7% reported US$50,000–75,000, 14.9% reported US$75,000–100,000, 10.5% reported US$100,000–150,000, 2.8% reported US$150,000–250,000 and less than 1% reported more than US$250,000; the remainder preferred not to disclose. In terms of the highest level of education acquired by each annotator, 2.7% selected ‘Below High School’, 17.5% selected ‘High School’, 29.2% selected ‘Technical/Community College’, 34.5% selected ‘Undergraduate degree’, 14.8% selected ‘Master’s degree’ and less than 1% selected ‘Doctorate degree’; the remainder preferred not to disclose.
Constructing a gender dimension in word embedding space
Our method for measuring gender associations in text relies on the fact that word embedding models use the frequency of co-occurrence among words in text (for example, whether they occur in the same sentence) to position words in an n-dimensional space, such that words that co-occur together more frequently are represented as closer together in this n-dimensional space. The ‘embedding’ for a given word refers to the specific position of this word in the n-dimensional space constructed by the model. The cosine distance between word embeddings in this vector space provides a robust measure of semantic similarity that is widely used to unpack the cultural meanings associated with categories13,22,31. To construct a gender dimension in word embedding space, we adopt the methodology recently developed by Kozlowski et al.22. In their paper, Kozlowski et al.22 construct a gender dimension in embedding space along which different categories can be positioned (for example, their analysis focuses on types of sport). They start by identifying two clustered regions in word embedding space corresponding to traditional representations of females and males, respectively. Specifically, the female cluster consists of the words ‘woman’, ‘her’, ‘she’, ‘female’ and ‘girl’, and the male cluster consists of the words ‘man’, ‘his’, ‘he’, ‘male’ and ‘boy’. Then, for each of the 3,495 social categories in WordNet, we calculated the average cosine distance between this category and both the female and the male clusters. Each category, therefore, was associated with two numbers: its cosine distance with the female cluster (averaged across its cosine distance with each term in the female cluster), and its cosine distance with the male cluster (averaged across its cosine distance with each term in the male cluster). Taking the difference between a category’s cosine distance with the female and male clusters allowed each category to be positioned along a −1 (female) to 1 (male) scale in embedding space. The category ‘aunt’, for instance, falls close to −1 along this scale, whereas the category ‘uncle’ falls close to 1 along this scale. Of the categories in WordNet, 2,986 of them were associated with embeddings in the 300-dimensional word2vec model of Google News, and could therefore be positioned along this scale. All of our results are robust to using different terms to construct the poles of this gender dimension (Supplementary Fig. 18). However, our main analyses use the same gender clusters as ref. 22.
To compute distances between the vectors of social categories represented by bigrams (such as ‘professional dancer’), we used the Phrases class in the Gensim Python package, which provided a prebuilt function for identifying and calculating distances for bigram embeddings. This method works by identifying an n-dimensional vector of middle positions between the vectors corresponding separately to each word in the bigram (for example, ‘professional’ and ‘dancer’). This technique then treats this middle vector as the singular vector corresponding to the bigram ‘professional dancer’ and is thereby used to calculate distances from other category vectors. This same method was applied to the construction of embeddings for all bigram categories in all models.
To maximize the similarity between our text-based and image-based measures of gender association, we adopted the following three techniques. First, we normalized our textual measure of gender associations using minimum–maximum normalization, which ensured that a compatible range of values was covered by both our text-based and image-based measures of gender association. This is helpful because the distribution of gender associations for the image-based measure stretched to both ends of the −1 to 1 continuum as a result of certain categories being associated with 100% female faces or 100% male faces. By contrast, although the textual measure described above contains a −1 (female) to 1 (male) scale, the most female category in our WordNet sample has a gender association of −0.42 (‘chairwoman’), and the most male category has a gender association of 0.33 (‘guy’). Normalization ensures that the distribution of gender associations in the image- and text-based measures both equally cover the −1 to 1 continuum, so that paired comparisons between these scales (matched at the category level) can directly examine the relative ranking of a category’s gender association in each measure. Minimum–maximum normalization is given by the following equation:
$$\widetilde{{x}_{i}}=\frac{\left({x}_{i}-{x}_{\min }\right)}{\left({x}_{\max }-{x}_{\min }\right)}$$
(1)
where xi represents the gender association of category xi ([−1,1]), xmin represents the category with the lowest gender score, xmax represents the category with the highest gender score, and \(\widetilde{{x}_{i}}\) represents the normalized gender association of category xi. To preserve the −1 to 1 scale in applying minimum–maximum normalization, we applied this procedure separately for male-skewed categories (that is, all categories with a gender association above 0), such that xmin represents the least male of the male categories and xmax represents the most male of the male categories. We applied this same procedure to the female-skewed categories, except that, because the female scale is −1 to 0, xmin represents the most female of the female categories and xmax represents the least female. For this reason, after the 0–1 female scale was constructed, we multiplied the female scores by −1 so that −1 represented the most female of the female categories and 0 represented the least. We then appended the female-normalized (−1 to 0) and male-normalized (0 to 1) scales. Both the male and female scales before normalization contained categories with values within four decimal points of zero (|x| < 0.0001), such that this normalization technique had no effect of arbitrarily pushing certain categories towards 0. Instead, the above technique has the advantage of stretching out the text-based measure of gender association to ensure that a substantial fraction of categories reach all the way to the −1 female region and all the way to the 1 male region of the continuum, similar to the distribution of values for the image-based measure.
Experimental methods
Participant pool
For this experiment, a nationally representative sample of participants (n = 600) was recruited from the popular crowdsourcing platform Prolific, which provides a vetted panel of high-quality human participants for online research. No statistical methods were used to determine this sample size. A total of 575 participants completed the task, exhibiting an attrition rate of 4.2%. We only examine data from participants who completed the experiment. Our main results report the outcomes associated with the Image, Text and Control conditions (n = 423); in the Supplementary Information, we report the results of an extra version of the Text condition involving the generic Google search bar (n = 150; Supplementary Fig. 26). We only examine data from participants who completed the task. To recruit a nationally representative sample, we used Prolific’s prescreening functionality designed to provide a nationally representative sample of the USA along the dimensions of sex, age and ethnicity. Participants were invited to partake in the study only if they were based in the USA, fluent English speakers and aged more than 18 years. A total of 50.8% of participants were female (no participants identified as non-binary). All participants provided informed consent before participating. This experiment was run on 5 March 2022.
Participant experience
Extended Data Fig. 2 presents a schematic of the full experimental design. This experiment was approved by the Institutional Review Board at the University of California, Berkeley. In this experiment, participants were randomized to one of four conditions: (1) the Image condition (in which they used the Google Image search engine to retrieve images of occupations), (2) the Google News Text condition (in which they used the Google News search engine, that is, news.google.com, to retrieve textual descriptions of occupations), (3) the Google Neutral Text condition (in which they used the generic Google search engine, that is, google.com, to retrieve textual descriptions of occupations) and (4) the Control condition (in which they were asked at random to use either Google Images or the neutral (standard) Google search engine to retrieve descriptions of random, non-gendered categories, such as ‘apple’). Note that, in the main text, we report the experimental results comparing the Image, Control and Google News Text conditions; we present the results concerning the Google Neutral Text condition as a robustness test in the Supplementary Information (Supplementary Fig. 26).
After uploading a description for a given occupation, participants used a −1 (female) to 1 (male) scale to indicate which gender they most associate with this occupation. In this way, the scale participants used to indicate their gender associations was identical to the scale we used to measure gender associations in our observational analyses of online images and text. In the control condition, participants were asked to indicate which gender they associate with a given randomly selected occupation after uploading a description for an unrelated category. Participants in all conditions completed this sequence for 22 unique occupations (randomly sampled from a broader set of 54 occupations). These occupations were selected to include occupations from science, technology, engineering and mathematics, and the liberal arts. Each occupation that was used as a stimulus could also be associated with our observational data concerning the gender associations measured in images from Google Images and the texts of Google News. Here is the full preregistered list of occupations used as stimuli: immunologist, mathematician, harpist, painter, piano player, aeronautical engineer, applied scientist, geneticist, astrophysicist, professional dancer, fashion model, graphic designer, hygienist, educator, intelligence analyst, logician, intelligence agent, financial analyst, chief executive officer, clarinetist, chiropractor, computer expert, intellectual, climatologist, systems analyst, programmer, poet, astronaut, professor, automotive engineer, cardiologist, neurobiologist, English professor, number theorist, marine engineer, bookkeeper, dietician, model, trained nurse, cosmetic surgeon, fashion designer, nurse practitioner, art teacher, singer, interior decorator, media consultant, art student, dressmaker, English teacher, literary agent, social worker, screen actor, editor-in-chief, schoolteacher. The set of occupations that participants evaluated was identical across conditions.
Once each participant completed this task for 22 occupations, they were then asked to complete an IAT designed to measure the implicit bias towards associating men with science and women with liberal arts33,34,35,38. The IAT was identical across conditions (‘Measuring implicit bias using the IAT’). In total, the experiment took participants approximately 35 minutes to complete. Participants were compensated at the rate of US $15 per hour for their participation.
Measuring implicit bias using the IAT
The IAT in our experiment was designed using the iatgen tool33 (https://iatgen.wordpress.com/). The IAT is a psychological research tool for measuring mental associations between target pairs (for example, different races or genders) and a category dimension (for example, positive–negative, science–liberal arts). Rather than measuring what people explicitly believe through self-report, the IAT measures what people mentally associate and how quickly they make these associations. The IAT has the following design (description borrowed from iatgen)33: “The IAT consists of seven ‘blocks’ (sets of trials). In each trial, participants see a stimulus word on the screen. Stimuli represent ‘targets’ (for example, insects and flowers) or the category (for example, pleasant–unpleasant). When stimuli appear, the participant ‘sorts’ the stimulus as rapidly as possible by pressing with either their left or right hand on the keyboard (in iatgen, the ‘E’ and ‘I’ keys). The sides with which one should press are indicated in the upper left and right corners of the screen. The response speed is measured in milliseconds.” For example, in some sections of our study, a participant might press with the left hand for all male + science stimuli and with their right hand for all female + liberal arts stimuli.
The theory behind the IAT is that the participant will be fast at sorting in a manner that is consistent with one’s latent associations, which is expected to lead to greater cognitive fluency in one’s intuitive reactions. For example, the expectation is that someone will be faster when sorting flowers + pleasant stimuli with one hand and insects + unpleasant with the other, as this is (most likely) consistent with people’s implicit mental associations (example borrowed from iatgen). Yet, when the category pairings are flipped, people should have to engage in cognitive work to override their mental associations, and the task should be slower. The degree to which one is faster in one section or the other is a measure of one’s implicit bias.
In our study, the target pairs we used were ‘male’ and ‘female’ (corresponding to gender), and the category dimension referred to science–liberal arts. To construct the IAT, we followed the design used by Rezaei38. For the male words in the pairs, we used the following terms: man, boy, father, male, grandpa, husband, son, uncle. For the female words in the pairs, we used the following terms: woman, girl, mother, female, grandma, wife, daughter, aunt. For the science category, we used the following words: biology, physics, chemistry, math, geology, astronomy, engineering, medicine, computing, artificial intelligence, statistics. For the liberal arts category, we used the following words: philosophy, humanities, arts, literature, English, music, history, poetry, fashion, film. Extended Data Figs. 3–6 illustrate the four main IAT blocks that participants completed (as per standard IAT design, participants were also shown blocks 2, 3 and 4, with the left–right arrangement of targets reversed). Participants completed seven blocks in total, sequentially. The IAT instructions for Extended Data Fig. 3 state, “Place your left and right index fingers on the E and I keys. At the top of the screen are 2 categories. In the task, words and/or images appear in the middle of the screen. When the word/image belongs to the category on the left, press the E key as fast as you can. When it belongs to the category on the right, press the I key as fast as you can. If you make an error, a red X will appear. Correct errors by hitting the other key. Please try to go as fast as you can while making as few errors as possible. When you are ready, please press the [Space] bar to begin.” These instructions are repeated throughout all blocks in the task.
To measure implicit bias based on participants’ reaction times during the IAT, we adopted the following standard approach (used by iatgen). We combined the scores across all four blocks (blocks 3, 4, 6 and 7 in iatgen). Some participants are also faster than others, adding statistical ‘noise’ as a result of variance in overall reaction times. Thus, instead of comparing within-person differences in raw latencies, this difference is standardized at the participant level, dividing the within-person difference by a ‘pooled’ standard deviation. This pooled standard deviation uses the standard deviation of what are called the practice and critical blocks combined. This yields a D score. In iatgen, a positive D value indicates association in the form of target A + positive, target B + negative, which in our case is male + science, female + liberal arts), whereas a negative D value indicates the opposite bias (target A + negative, target B + positive, which in our case is male + liberal arts, female + science), and a zero score indicates no bias.
Our main experimental results evaluate the relationship between the participants’ explicit and implicit gender associations and the strength of gender associations in the Google images and textual descriptions they encountered during the search task. The strength of participants’ explicit gender associations is calculated as the absolute value of the number they input using the −1 (female) to 1 (male) scale after each occupation they classified (Extended Data Fig. 2). Participants’ implicit bias is measured by the D score of their results on the IAT designed to detect associations between men and science and women and liberal arts. To measure the strength of gender associations in the Google images that participants encountered, we calculated the gender parity of the faces uploaded across all participants who classified a given occupation. For example, we identified the responses of all participants who provided image search results for the occupation ‘geneticist’, and we constructed the same gender dimensions as described in the main text, such that −1 represents 100% female faces, 0 represents 50% female (male) faces and 1 represents 100% male faces. To identify the gender of the faces of the images that participants uploaded, we recruited a separate panel of MTurk workers (n = 500) who classified each face (there were 3,300 images in total). Each face was classified by two unique MTurkers; if they disagreed in their gender assignment, a third MTurk worker was hired to provide a response, and the gender identified by the majority was selected. We adopted an analogous approach to annotating the gender of the textual descriptions that participants uploaded in the text condition. These annotators identified whether each textual or visual description uploaded by participants was female (1), neutral (0) or male (1). Each textual description was coded as male, female or neutral on the basis of whether it used male or female pronouns or names to describe the occupation (for example, referred to a ‘doctor’ as ‘he’); textual descriptions were identified as neutral if they did not ascribe a particular gender to the occupation described. We were then able to calculate the same measure of gender balance in the textual descriptions uploaded for each occupation as we applied in our image analysis.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.