Challenges in Automated Policy Enforcement

While there are many benefits to automation and AI, including the ability to process massive amounts of content more quickly than any human reviewer, there are a variety of challenges when deploying such techniques. Automation and AI are not always the solution for critical trust and safety tasks like policy enforcement and, given the nascency of these tools and techniques, extra caution is often warranted. This section outlines several costs, challenges, considerations, and risks involved in using automated and AI-driven systems in trust and safety policy enforcement.

Cost of Implementation and Maintenance

Implementing ML for policy enforcement can incur considerable costs, including costs to hire and retain machine learning developers and to label, store, and process data. Once a model has been trained and deployed, its performance needs to be monitored over time. Moreover, because some types of content (e.g., spam, hate speech, and misinformation) frequently change and evolve, model performance will necessarily degrade over time and will require retraining at regular intervals.

Data Availability, Quality, and Accuracy

As previously discussed, while some AI models may be unsupervised, the more robust and effective systems used in policy enforcement are supervised models, which require labeled data. In other words, the system requires known examples from which to learn. In some instances, data may not be available for labeling or the labeled data may be of poor quality. Data may not be available for a variety of reasons: the service has not launched yet so no data or comparable data set yet exists; data may not be retained or is otherwise inaccessible due to user privacy considerations; or the data may not be structured in a way that makes it accessible to labeling.

Poor quality or inaccurate data presents another challenge when trying to develop an effective model. To address this, some organizations use their internal reviewers or moderators to label data and others outsource to third-party contractors or through crowdsourcing platforms such as Amazon Mechanical Turk and Appen.

Trained moderators have the highest degree of knowledge about specific content or behavior policies, and can therefore generate the labels that reflect those policies most reliably. User reports can also serve as a possible source of labels, but they are less reliable and have been manipulated to implicate benign users as a means of harassment. Even in the absence of such harmful manipulation, users often lack a complete understanding of what constitutes a policy violation as defined by the service’s Community Guidelines or Terms or Service, and therefore, reporting may not be accurate.

Furthermore, even the most knowledgeable labelers will run into edge cases where it is unclear how a particular policy should be applied, and they must then make a judgment call on how the instance is labeled. The experiences and perspective of the labeler will affect their judgment in these cases, introducing bias (see Bias in AI Systems below). In situations with less reliable labels, the resulting labels may be post-processed with various strategies such as voting systems to mitigate the effects of erroneous labels.

Defining the Appropriate Action Threshold

Often, models used for trust and safety will output a continuous score or prediction, representing the model’s confidence that the given input is a member of the positive class. For example, consider a binary classifier trained to detect hate speech. In this case, the inputs are the pieces of content being evaluated, and for each one, the classifier will answer the question “Is this hate speech?” with a prediction that ranges from 0 to 1, representing the model’s confidence that the content is indeed hate speech.

Next, it is necessary to determine a threshold for selecting which content will be reviewed, downranked, automatically removed, or actioned in another way. If this threshold is set very high, then there will be less content with scores above the threshold, but each one represents a post that the model is highly confident is hate speech. On other hand, a high threshold could miss a lot of true examples of hate speech about which the model was not as confident.

If a lower threshold is chosen, more true hate speech examples may be identified, but there may also be more false positives—content the model thought was likely to be hate speech, but was not.

These thresholds might be set based on a chosen constraint on the rate of false positives, agent review capacity, or the deployment context. Multiple thresholds may be used for the same model and task. In this case, a system where content the model scores over 0.99 is automatically removed, but content the model scores between 0.8 and 0.99 is queued for human review may be implemented.

Systemic Risk

Another consideration when using any type of automation for policy enforcement is systemic risk. Because automation is applied at scale, if the model is too broad or makes a mistake, large portions of the system may be affected, resulting in overly broad or extended actions. As an example, in March of 2021, Twitter users discovered a bug that resulted in users being suspended for up to 12 hours for tweets that included the word “Memphis.” Although the exact automation that caused this error is unknown, because of its ability to act in real time and at scale, the errors an automation system makes can far surpass errors caused by manual content moderation methods.


A useful property for models tasked with making predictions such as “Is this violating policy?” is some kind of justification for the prediction. Models with this property can not only identify a violation but provide information on how they came to that conclusion. For example, this may be done by stating how much of a piece of spam appears to be scraped from other sources and how much this contributed to the level of confidence in the model decision. This property is referred to as explainability.

Explainability can be helpful to trust and safety teams in a number of ways. A greater understanding of the specific signals driving model decisions can provide insight to model builders, reviewers, and policy designers. This can improve general performance, and is particularly useful for determining if models are skewed or biased, or have exploitable vulnerabilities. Explainability can also help with general maintenance and with identifying shifts and trends in the populations to which the models are being applied.

Better explainability can also be useful for communications with users. Exact details of models and signals are usually withheld from users to prevent violators from exploiting loopholes or weaknesses. However, being able to highlight granular behavior more clearly to the user can help them resolve violations and avoid repeating them. Examples of this include pointing to a specific timestamp example in a video where violation is clear, or letting a website owner know that keyword stuffing behavior was detected on their site. These details can be communicated to the user, either through automatic messaging or through more manual or interactive processes like contact form responses or appeals.

There are also some situations where more detailed explanations of a decision are not just useful but also necessary when resolving the underlying issue. Consider a death threat that a model has identified as being imminent and credible and that needs to be reported to law enforcement. Simply stating that a user scored above a certain threshold on a classifier is not nearly as useful as a list of specific criminal or dangerous behaviors that would justify police action, particularly when communicating with external groups like law enforcement.

There are two major sources of explainability. Firstly, information about how the model makes decisions is produced during the model creation process. Many models use measures of how much a particular variable improves the performance of a model as part of the process of building the model, to decide which variables to include, where to set thresholds, and how much weight to give them. Secondly, explainable signals can sometimes be found through inspection of the model after creation. For example, if the model links a set of features together, such as a particular combination of shape and color in an image, then this might provide insights into specific objects the model is searching for which correspond with violations.

Explainability can influence which models and signals are used for automation. For models with complex structures or large numbers of variables, it can be challenging to identify clear reasons for decisions. However, more complex models can sometimes capture more nuanced relationships between different signals. Sometimes there may be a tradeoff between explainability and performance.

As of 2022, explainability is an active and evolving area of machine learning research. Examples of modern techniques designed to help explain the inner workings of models include: attention maps, which highlight the pixels in a particular image that contribute most to the eventual prediction; and more generically feature importances, which can similarly show which features influence the model the most. Another technique involves using outputs from a complex model to train a simpler model that approximates the same result but is more easily explainable. Explainability functions are now regularly included in both commercial and open source AI tools and packages.

AI Alignment

A related issue to explainability is that of AI alignment. A model is well aligned when the decisions it makes and the reasons for those decisions are consistently the same as the model creators and owners would want them to be. This is different from more general model performance because the model can be classifying or prioritizing outputs well and still be misaligned if it is making the right decisions for the wrong reasons.

For example, imagine a model trained to catch a particular type of email spam that notices that the biggest producer of that spam tends to use the same basic format for email addresses. If the model uses that format as a strong signal, it is not actually targeting behavior that is inherent to the violations. This leaves the model vulnerable to sudden dips in performance.

Alignment problems often occur when there is a high degree of subjectivity and nuance in the decisions a system is making. Such problems are common in trust and safety due to the complex ethical questions, cultural differences, and context involved in this type of work. These issues can often be challenging to quantify in ways that can be interpreted by a computer.

AI alignment is another active area in modern AI research. It covers both specific problems and solutions for measuring and improving alignment, and the more generalized issue of how to make sure that AI is aligned with the interests and goals of society and humanity. As a result, current research contains a mix of technical evaluations and solutions, and discussions of moral and ethical frameworks for using AI.

Bias in AI Systems

Just as humans bring their own biases to decision-making, algorithms trained to make trust and safety predictions can also display bias. Bias can crop up in AI systems during any stage of development and can be introduced through the data the model is trained and tested on, the structure of the model, or through where and how the model is deployed. It is impossible to have a perfectly unbiased model, but cautious system design can help to mitigate harms.

Underlying Data Challenges

A common cause of bias in AI is issues with the underlying data that AI models are trained on (see Data Availability, Quality, and Accuracy above). Training data that contains errors or does not represent the true population data will skew models created with that data. Similarly, poor test data may give a false sense of confidence in a model. Models are often particularly vulnerable to spurious data correlations that result in biases, since they are almost always designed to look for correlations, not causation.

As training data is usually provided by humans, it is also subject to human bias. Individual labellers and users will have their own personal biases that may affect reviews, but bias can also be rooted in more systemic issues. Neither staff hired to perform reviews nor users who volunteer or submit ratings are likely to be exactly the same as the general user base of a platform, and this can result in bias towards or away from different viewpoints.

One example of this is the reviewer location. Often reviewers are based in a few large hubs in places where either the platform is located or where staff with appropriate knowledge such as languages and subject matter expertise are available and affordable. This can limit the diversity of opinion and bias reviewers towards the specific cultural perspectives common in those locations.

AI systems can also magnify existing bias by a process called reinforcement when model outputs are reused as labels or otherwise create a feedback loop. For example, if an AI system concludes, from bias, that certain content is more likely to be violative, that content is in turn more likely to be prioritized for review. This can result in additional reviews affected by the bias, the extra data from which can cause the system to become more confident in the biased decisions it’s making.

Language and Population Challenges

One example of bias in trust and safety AI is language bias. Many of the world’s biggest platform companies were founded and remain headquartered in the United States, and English is used around the world in business and academic contexts. Thus, many of the NLP tools and models perform best on English text. There is often more data in English available, more pretrained models that support English, and more English speakers who can understand and evaluate model results as compared to other languages.

In August 2021, Twitter conducted an algorithmic bug bounty contest for detecting bias in its image cropping algorithms. The platform used a machine learning model to determine the most relevant and interesting sections of the photos users uploaded and cropped images automatically to center those sections. Independent researchers identified several issues, noting the algorithm was more likely to center characters from the modern Latin alphabet (A-Z) and to crop out characters from other alphabets such as Arabic or Cyrillic script.

No model perfectly captures the standards of every user it affects. Models are often vulnerable to bias towards the most active users who generate more activity data, while making more mistakes on rarer behavior. This can sometimes create a tradeoff between building generalized models with a lot of data to improve accuracy but possible bias towards the majority, and specialized models for particular areas which have less data with which to train.

Some words and phrases may violate policy in some languages or contexts, while being totally innocuous in others. One example of this is the French town of Bitche—in French, this name is not a profanity, but it may be mistaken for profanity by models trained mostly with English users. A pejorative slur may also be reclaimed by the affected community, in which case its usage may not be penalized.

Bias can also be caused by increased difficulty in the underlying task being performed by the AI system affecting some populations more than others. For example, a piece of text that discusses hate speech is likely to contain many similar words and textual features to a piece of text that is itself hate speech, making it harder to evaluate than a piece of text about an unrelated topic. This means that protected groups who are frequently affected by such speech can be disproportionately affected by errors because they are more likely to post text that is difficult to classify.

Mitigation Measures

Testing systems before deployment with a range of curated and labeled examples can provide insights into both general trends and specific known cases of bias. This set can also include manufactured examples when there is not a large enough sample of historical data. A strong process for examining trends in user appeals and false positives is important too as it will provide continuous feedback when populations and behaviors change over time. A variety of open source tools and libraries related to errors and bias in AI are also currently accessible. (See AI Fairness Tools below for a list of various fairness tools.)

Strategies for reducing bias often involve improving the quality, detail, and scope of the underlying data used to train the system. Ensuring that groups vulnerable to bias are well represented, through increased sampling or injection of curated examples, can mitigate bias for these groups too. Conversely, sometimes removing or reducing the power of signals vulnerable to bias can improve the fairness of a system.

In particular, it is unwise to include signals that represent or closely align with groups vulnerable to discrimination in systems that make decisions. A common example of this is location data such as zip code which often correlates with race and has been identified as a source of bias in financial, policing, and health care systems. Such signals should generally be used for monitoring trends only.

Adversarial Actions

Users can avoid and attack automated systems for policy enforcement in a myriad of creative ways. First, they might use adversarial actions to evade known automated systems, such as making small changes to images and videos that aren’t allowed. As noted earlier in this chapter, users may speed up or slow down audio, or add visual artifacts like squares and glitter in attempts to avoid detection.

In addition, models are themselves susceptible to adversarial examples. To use a simple example, consider a model trained to detect spam in which one of the correlations that the model has learned is that a word w is highly associated with spam. An adversary might want to spam other users with content including w but, because of the inclusion of w, the spam was detected. They might then very cleverly post a lot of relevant, informative, non-spam content using w, weakening the correlation the model has learned. Then, if the adversarial attack is successful, the model will no longer mark the original post as spam. Similarly, attackers might try to poison spam classifiers by crafting a high volume of high-quality, non-spam like content that includes spam components, as one Arabic-language spam campaign did successfully on Twitter to promote the sale of Viagra, weight loss, and hair treatment drugs.

Finally, underlying processes may also be attacked by adversarial users, such as coordinated campaigns to report another user who has not violated any policy, or flooding the appeals process to gum up the system. On Instagram, users have reported their accounts being suspended due to mass false reports from scammers who hope to gain control over the account or username so that they can then sell it.

Implementation of Appeals Process

AI models are rarely perfectly precise. Even if they are, their precision is generally limited to a narrow selection of cases. Most of the AI and automated systems used for policy enforcement will have some known, non-zero rate of error. Thus, it is crucial users are able to appeal an automated decision that impacts their account or content if they believe the decision was made in error.

Limits of Automation

Automation and AI systems can provide exceptional improvements to the overall performance and efficiency of trust and safety operations. However, there are many areas within trust and safety where these systems are less useful or introduce additional risks. These risks, just like the benefits, can scale rapidly as automated systems can take huge numbers of actions at a time.

Some of the problems described in this chapter are inherently more likely to be difficult to automate than others. For example, it is usually harder to automatically classify video than text. Similarly, policies that often rely on context and nuance such as hate speech are harder to automate than those that rely on near or exact matches to specific words or content such as profanity filters and copyright violations.

Automation is usually less relied upon in systems designed to provide scrutiny to decisions—examples of this include user appeals and quality checks. These procedures effectively provide a safety net or sense check and automating them increases the chances that significant problems go unidentified.

One feature of certain AI systems, most notably large neural networks, is that they become “black boxes” where the reasoning behind a decision is not always easy to extract from the model. If providing such reasoning is important or if a clear explanation of nuanced and specific details is required, manual review and processing may be a necessary step. See Explainability and AI Alignment for more details.

Automation may also provide challenges in areas with strong legal requirements or restrictions. For example, the NetzDG law targeting hate speech that passed in 2017 in Germany requires deletion of content that is “obviously illegal” within 24 hours of it being reported. This creates a risk of significant fines for automatically approving content to remain if it has been reported under NetzDG rules.

It is important to note that using these systems is not a binary choice. By customizing sensitivity, it is possible to automate only decisions where there is extremely high confidence and leave the remainder to be checked by human teams. It is also often possible to use such systems to assist manual reviews by highlighting points of concern, grouping related content together, and prioritizing reviews by the likelihood and severity of violations.


Automation and AI is one of the most powerful and effective strategies for ensuring that a platform is kept safe. Carefully designed and routinely maintained systems can provide reliable decisions to protect users, and do it at the scale, speed, and reduced cost necessary to handle global populations and malicious actors who have their own scalable methods.

However, these systems are not the perfect solution to every problem. Automation still requires dealing with complex and subjective decisions, ethical dilemmas, and a huge range of potential context and nuance, as well as formidable technical challenges. AI is dependent on support and information from human beings, and can reflect and magnify human errors and biases.

Though AI and automation is a field that has made and continues to make significant advancements, there are still many important areas where it has not yet reached the high levels of consistency and accuracy needed for certain tasks. Automation is a powerful tool and, because of this power, trust and safety professionals must carefully consider the risks and consequences to deploy it effectively.

Relevant Case Studies

A list of case studies relevant to this chapter:

AI Fairness Tools

A list of tools currently available for evaluating and handling bias and balance problems in ML and other automated systems:


Authors│Maggie Engler, Jeff Lazarus, James Gresham
Contributors│Amanda Menking
Special Thanks│Jorge Garcia Rey, Amy Zhang