Although automation tools used in trust and safety are evolving and developing, the field offers basic models and methods commonly used for policy enforcement, prevalence measurement, and routing. This section provides a cursory overview of the most commonly deployed types of automation tools, as well as examples of how they may be used in a trust and safety context.
Blocklisting and Allowlisting
Note: Blocklisting is occasionally referred to as deny listing, disallow listing, and blacklisting. “Blacklisting” is a legacy term and is no longer widely used due to racist connotations. Allowlisting is also sometimes called passlisting or whitelisting. Similar to “blacklisting,” “whitelisting” is a legacy term and is no longer widely used.
Blocklisting is a technique used to prevent, restrict, or limit access to content or the creation of content based on a predefined list. Blocklisting is among the most common and usually one of the first automated methods many services use. Blocklisting can be used to block:
- Specific pieces of content;
- Content that fits a predefined condition (e.g., any comment containing a particular word);
- Specific users who were previously banned, using identifying features such as email address or unique device identifiers;
- Sources with specific attributes such as IP addresses, sites, or links associated with spam, phishing, fraud, or malware.
Blocklisting can be an inexpensive and easily scalable method of enforcement and is best for policies and rules that are simple and clear. Despite the name, not all blocklisting actions fully block access or remove content. Temporary blocks, feature restrictions, demotion in rankings, and other weaker actions can also be employed. Such actions can also be taken proactively when dealing with sources that are unknown or suspicious in some way. This is sometimes referred to as “greylisting.”
Allowlisting is the opposite of blocklisting. While blocklisting prevents content from being added, allowlisting prevents content from being removed or actioned erroneously. Allowlisting is most often used in situations where the chances of a policy violation are extremely low and/or the harm caused by a penalty would be extremely high. Allowlisting is often applied to entities or accounts, such as:
- Accounts belonging to the platform itself;
- News outlets;
- Public figures (i.e., celebrities or politicians);
- Government institutions;
- Major businesses and brands.
Determining which accounts to include in an allowlisting can be done automatically based on signals like an account’s number of “followers,” verified account or partner status, or through manually curated lists. Allowlisting can also protect the account from automatic action by requiring a higher standard of review before a penalty is imposed if the account content appears to violate policy.
In-depth Look: Allowlisting and Nonconsensual Imagery Posts by Celebrities
A well publicized example where allowlisting was raised as an issue took place in June 2019 in Brazil. A professional footballer accused of rape posted a video to Instagram denying the accusation and claiming extortion. As part of that denial, he also included screenshots of private messages and semi nude images posted without consent. The video was deleted approximately a day later.
The incident was later linked to Meta policies involving allowlisting, most notably the XCheck system, which Meta described as giving a second layer of review to make sure policies are applied correctly. Media outlets raised the issue of whether this system applied a different standard or quality of review, possibly resulting in unequal treatment for celebrities. Another concern was that the system may have delayed the takedown of violating content, or that some violating content may not be prioritized for review at all.
Following a review in 2021, Meta’s Oversight Board criticized the company for “failing to provide relevant information” about the XCheck program and how it has been used.
Similarity learning compares a piece of content to other similar content that has already been removed or actioned in some way. If the former content is the same as or extremely similar to the latter, it will be automatically removed or actioned (or sent to a human reviewer for confirmation). This type of AI is often used to handle banned viral images and videos, as well as copyright enforcement and spam. This approach allows detection of near-but-not-exact duplicates, which otherwise would require additional review time to evaluate. Similarity learning is particularly helpful for removing reposts, reshares, or reuploads of the same, but slightly altered, material.
Because similarity learning is not designed to identify context, it is an unreliable tool for removing content when context is required. For example, news accounts or journalists may include copyrighted videos or images in their news reports, but their use may fall under various fair use laws. Similarity learning may flag fair use content as a copyright infringement. Another example is when a piece of content is used for different purposes, such as a human rights group documenting war crimes versus a terrorist organization using the same content for propaganda. The former use may be acceptable, whereas the latter use may violate the platform’s policies.
Bad actors often attempt to evade similarity learning systems by making alterations to content that make it appear different to those systems while still being recognizable to humans. Common examples of this include adding border frames or random noise to images and video or marginally changing the pitch or speed for audio. The further the content gets from the original, the more difficult it is to automatically match it with high levels of confidence.
Comparing large files to each other can be slow and computationally expensive, particularly with large databases of banned content. One method to streamline this process is to give a piece of content a unique digital fingerprint, known as a hash. This is done by running the content through a hash function, which converts large files into a fixed-length string of characters. These strings are much easier to compare quickly and at scale than the original files they represent.
Hash functions broadly fall into two categories. (1) Locality sensitive functions can be used to find matches to a file, as well as files that are very similar through a process called fuzzy hashing. (2) Locality insensitive functions are only useful for exact matches, but provide stronger user privacy protections and security against adversarial attacks.
Platforms will often maintain internal databases of known hashes that correspond to violating content. External groups such as the National Center for Missing and Exploited Children (NCMEC) and the Global Internet Forum to Counter Terrorism (GIFCT) also maintain their own hash databases filled with known illegal or harmful content from a variety of sources, so that this content can be identified and removed across many different platforms.
Hash matching is most well known for its application on child sexual abuse material (CSAM). Because there is effectively no context that would allow known CSAM on a service, hash matching can be used to tackle violations with speed and accuracy, while also reducing reviewer exposure. Hashes are also commonly used to flag viral content violations and copyrighted material.
In-depth Look: Different Types of Hash Matching
There are an infinite variety of possible functions for converting a file into a hash, each with different characteristics, strengths, and weaknesses. However, the most important characteristic for applying hash functions to most practical problems in trust and safety is whether a function is locality sensitive or insensitive.
Locality insensitive hash (LIH) functions are designed so that a small change in the input file will create a completely different hash. A locally insensitive hash function is also known as a cryptographic hash function. Examples include MD5, SHA1, and SHA-256.
Conversely, locality sensitive hash (LSH) functions produce hashes in such a way that similar content will tend to produce similar fingerprints, and thus are a form of similarity learning. A locally sensitive hash function is also known as a perceptual hash function. Examples include TMK-PDQF, PhotoDNA, and pHash.
LIH functions are limited to detecting exact, perfect matches of files. For example, an LIH function cannot match two copies of the same image at different resolutions or aspect ratios, or even with a single pixel changed. The two copies would produce completely different hashes, and there would be no way to identify from the hash alone that the two images are related.
LSH functions, on the other hand, are much more flexible. In addition to exact matches, an LSH function can also detect when two files are similar to each other. Different functions will evaluate similarity differently, and specialized functions exist for many common file types such as images and videos. Some LSH functions can also be used to identify specific features of files that might potentially be useful, such as age, file size, or resolution.
However, LSH functions also have their disadvantages. One weakness of LSH functions is reduced user privacy because the hash itself can reveal information about the file that is identifying. Hashes from a secure LIH function do not reveal any information about the files that have been hashed unless an exact match can be found, increasing the level of privacy around user files.
Another concern with LSH functions is that they can be vulnerable to manipulation. It is sometimes possible to carefully craft a file which will produce the same hash as a specific target file. When two files produce the same hash like this, it is known as a hash collision. Bad actors can use this technique both to conceal violating content and to trick users into posting benign content that triggers penalties for serious policy violations. Hash collisions are generally much easier to create with LSH functions, although some older LIH functions have been cryptographically broken and are now vulnerable to this same type of attack.
Finally, because LSH functions are much more subjective and varied in how they evaluate similarity, they are more vulnerable to many of the challenges of automated systems such as bias, explainability, and systemic risk. LIH functions are limited to exact matches but are similarly limited in their exposure to these challenges. For more information, see Challenges in Automated Policy Enforcement.
Supervised and Unsupervised Models
One of the significant benefits of using AI is the ability to sift through massive amounts of content to look for patterns. This approach enables companies to quickly identify prohibited content, conduct, and accounts, or identify new trends of abuse in a way human reviewers are unable to do. Currently, AI is being used in a variety of ways to support trust and safety efforts. However, for AI to work effectively, it needs a large and diverse pool of data from which to learn.
The use of AI can be broadly divided into two categories. (1) Supervised learning uses labeled datasets to teach the system how to categorize data or predict outcomes in accordance with the data labeling. (2) Unsupervised learning does not use labeled datasets, but instead analyzes a dataset to identify common attributes, clusters, and patterns within that data.
Most AI efforts in content moderation use supervised models because they are generally more powerful when the types of violating content and behavior are already known. This is usually the case in content moderation, as these behaviors are defined by policy, Terms of Service, and/or community guidelines or standards. Labeled data is also usually available because most platforms have at least some level of human review or feedback which can be used for training models. In trust and safety, a supervised model might be used to detect abusive language, for example. To train the model, samples of content posted on the platform would be collected, and reviewers would assign a label to each piece of content: “abusive” or “not abusive.” The model would then learn to approximate this decision-making process. In cases where limited labeled training data is available or the labels are noisy, one might take an approach such as semi-supervised or weakly supervised learning, which combines elements of supervised and unsupervised models.
When unsupervised models are used, it is usually in the context of providing broader information—for example, identifying new patterns or populations—rather than taking direct action on specific cases. An example of an unsupervised model in trust and safety might be a tool that clusters similar accounts together. There are no labels, so the model is not predicting that a given cluster is good or bad, but reviewers could investigate particularly large clusters, as a large number of very similar accounts might indicate a coordinated network on the platform.
Object detection is used to identify and categorize objects in images and videos to identify whether that content violates the service’s policies or to trigger further review. Unlike similarity learning, object detection attempts to recognize general characteristics of a target object, such as shape and color, and then use these to detect the same type of object in a variety of different images.
Common uses of object detection in content moderation include detecting:
- Nudity and genitalia;
- Illegal drugs;
- Banned logos and symbols from terrorist organizations and hate groups.
Object detection can be useful in proactively identifying prohibited content but can result in errors if it misidentifies an object. As an example, a case study by the Trust and Safety Foundation examined how Facebook’s AI misidentified pictures of onions as nude body parts.
Character Recognition and Speech to Text
While a picture is worth a thousand words, words are still important when identifying whether a picture or video violates a policy. For image-based or image-heavy services (such as photo or video sharing apps), the words on an image can often be more relevant when determining whether that content is prohibited. This can include text overlays, subtitles, and labels, as well as text within the actual image itself—for example, on posters or banners. Character recognition is a process that extracts text from images for analysis. It is still more accurate and less expensive to automate text analysis through AI than to analyze text and images.
Speech to text is another tool in which audio or video can be analyzed quickly. As the name suggests, speech to text is the process of converting spoken words in an audio or video file into text. After converting audio information into a written format, it becomes more easily understood by the algorithms and keyword searches used for automated content review.
Speech to text is also useful for reviewing very large, long audio or video files. For example, if an hour long video is posted on a platform, it is prohibitively expensive to ask a person to manually watch and listen to it in its entirety. By converting the audio to text, either human or algorithmic systems can more quickly read or search through the information for potentially violating content.
Natural Language Processing
Natural language processing (NLP) is an area of AI concerned with analyzing large amounts of natural human language, usually written text. The aim of NLP is to identify meaningful and understandable information from human language while also accounting for context and ambiguity. Text posts, comments, messages, and subtitles from videos are all potential targets for NLP systems.
General NLP systems can be used to enhance trust and safety operations. For example:
- Translation systems can be used to allow reviewers to evaluate text in other languages.
- Grammar checkers can be used to tell if a website or piece of text makes coherent sense or contains only spam-like lists or gibberish.
- Topic classifiers can be used to identify what topics a piece of text references; for example, text that is about illegal drugs is more likely to contain policy violations related to illegal drugs.
Customized NLP systems can also be used for specific tasks relevant to policy enforcement. For example, detection of hate speech using NLP has been studied extensively both within the industry and in academia. Identifying hate speech requires understanding slurs and coded language as well as contextual clues to avoid penalizing reporting or criticism, making it a problem well suited to NLP.
Another well studied area in NLP is sentiment analysis. In most cases, certain types of prohibited content tend to be associated with very strong emotions. For example, bullying and threats of violence are likely to include strong negative sentiments. The use of AI to identify opinions and emotions expressed in a piece of content can be a useful tool. Sentiment models can determine whether these opinions and emotions are positive or negative, and how strongly they are being expressed.
Sentiment analysis is usually combined with other factors to accurately identify specific types of content and behavior. For example, a piece of content that has very strong negative sentiment and also includes mention of a group of people or religion could indicate that it is hate speech. Similarly, very positive sentiments in combination with the mention of a terrorist group could indicate that the content is praising or supporting terrorist activities.
Models in NLP range from extremely simple systems that count usage of different keywords to dense neural networks and transformer models pre-trained for general language understanding on the vast data stores of major technology companies. NLP is currently one of the fastest moving areas of AI because understanding human language is such a broadly useful skill for modern computing tasks.