This section discusses the “how” of Safety by Design and is organized into three parts. The first part relates to process: safety-oriented frameworks and processes to identify and assess risk of harm. The second part relates to technology and offers a menu of safety interventions. The third part enables evidence-informed action and continuous improvement via measurement and evaluation strategies.
Safety by Design Processes & Protocols
Some common tools and processes to assess and mitigate risks to user safety, are described below.
- Product risk assessments are common tools for identifying and evaluating risk in the early stages of product or feature development. These assessments collect cross-functional input from across the company (e.g., product, engineering, research, T&S, legal and public policy teams) on how the design of products can trigger hazards or create behavioral incentives that drive harm. Risk assessments are usually conducted prior to the release or launch of a new product or during significant external events (e.g., elections or armed conflict). They begin with an assessment of the risk of harm and conclude with the development of a mitigation plan based on risk parameters and tailored to business context.
- Privacy and security reviews: Safety risk assessments are often strengthened by active collaboration with privacy and legal reviews. These reviews are generally mandated by the data protection laws of most countries. Wider ‘safety’ reviews are also becoming necessary as online safety laws increasingly seek to regulate content moderation systems and processes.
- Safety oriented red-teaming exercises bring together external and internal experts to systematically identify vulnerabilities and abuse vectors in a specific product or design feature. Red teams can help provide a holistic perspective on threat exposure and also build broad-based buy-in for safety mitigations.
Assessing and Mitigating Risk of Harm
T&S practitioners generally assess risk of harm on parameters of severity and likelihood. Severity refers to the degree of impact of the harm, while likelihood refers to the probability of its occurrence. In practice, the most prevalent abuse types are not the most severe (for instance, spam), while the most severe harms are generally of lower prevalence. Scale and frequency are also useful parameters which consider the extent of impact on individuals and groups, and how often a given harm may recur. These risk parameters—as well as risk mitigation planning—are generally shaped by the factors described below.
- Design of a product/platform or business model: For example, products that host user-generated content (UGC) can result in the spread of illegal, harmful and undesirable content or conduct, which can have harmful impacts on user safety. Features that enable private messaging entail safety risks; features to which children have access, or which facilitate real-world contact are inherently riskier, and platforms with ad-driven or subscription-based business models may incentivize dissemination and engagement with harmful content. Certain harm types (e.g., hate speech) may be at higher risk of occurrence on specific types of product surfaces (e.g., recommender/ranked content feeds or real-time messaging platforms).
- User base characteristics: The stakeholders of a platform include its user groups (end users, merchants, drivers, gig workers, etc.) and collectives like societies and demographic groups—which experience second-order effects of harms such as misinformation or civic manipulation. It is important to note that the risks and impact of harms are not uniformly experienced—they are conditioned by user characteristics such as identity (race, sexual orientation, gender, religious identity), age, location, and circumstances (mental illness, physical disability or material deprivation). The Office of Communications (Ofcom), the UK’s online safety regulator, has commissioned useful research on the ways in which users experience online harms. Qualitative assessments on how user cohorts experience harm are important inputs for assessing its impact. Correspondingly, there are tailored and specific mitigations to address the risks and impacts of harms uniquely experienced by specific user groups.
- Wider external environment: When assessing harm, platforms should also consider the wider business, socio-political, and regulatory environment. Rapid or large-scale social or political change can impact the magnitude of risk factors. Similarly, business transformation projects such as acquisitions, mergers or commercial partnerships can change overall platform risk. Tailored packages of mitigations may be implemented during specific crisis events. For example, elections might entail virality control measures, UX nudges to discourage anti-social behavior or adding authoritative information in core product experiences. The Integrity Institute’s Election Integrity Best Practices Guide is an excellent repository of civic integrity recommendations.
- Geographic and cultural considerations: Technologies and products developed in one market and deployed in another can trigger novel risks or exacerbate existing ones. The political and legal environment, cultural factors, and patterns of use and abuse within specific regions deeply influence the risk of harm. For instance, a product/feature introducing the ability for users to generate content deployed in a country acknowledged as “not free” or “partly free” or where there are documented threats to civil liberties and internet freedom could create risks to the rights and safety of users. Platforms should actively incorporate market-specific risk factors into the assessment of harm, the design of products, and risk control planning.
| In-depth look: Safety by design for specific user demographics Children and youth require unique and targeted design considerations that account for their developmental needs. This includes content and behavioral policies, default high-privacy settings, age-gating and age verification mechanisms, parental controls, age-appropriate classification indicators and restrictions on targeted advertising. The UK Information Commissioner’s Age appropriate design code and legislative proposals like the California age-appropriate design code act are good places to start examining safety-first design choices while building products intended for use by younger users. The Tech Coalition recently released helpful guidance on how to assess child sexual exploitation and abuse harms in product development. LGBTQ+ users face disproportionate rates of online and offline harm compared to other identity groups. They report higher rates of online harassment and real-world violence, which necessitate a thoughtful approach to product design and moderation. GLAAD, a leading LGBTQ advocacy organization, recommends practices to ensure the safety of LGBTQ groups. Examples include: specific and targeted protection of sexual and gender identities in community guidelines, gender-inclusive identity options, tools to allow users to report, flag or block abusers, lewd image detectors and elements of design “friction” to slow the spread of hate and disinformation targeting LGBTQ users (such as content interstitials), restriction of anti-LGBTQ hashtags, and toxic reply nudges. |
Risk assessments are generally followed by mitigation planning, which is informed by the assessed propensity for harm and tailored to the type of feature, the context of its deployment and wider business considerations which include resourcing, effort, and impact. The Integrity Institute’s Focus on Features is an excellent resource that pairs platform design recommendations with harm types.
Safety by Design Interventions
This section addresses the technology elements of implementing Safety by Design. Safety-oriented functionalities can be classified according to the remediate, reduce, and prevent harm/promote safety rubric discussed in the introduction to this chapter.
Promote Safety and Prevent Harm:
- Design elements such as user controls, default settings and intuitive, user-friendly design interfaces, and inclusion of other evidence-based design features aimed to optimize for healthy interactions.
- Safety enhancing features that enhance user safety by protecting their information, enabling real-time safety alerts, and/or giving them the ability to control their own exposure to harm.
- Proactive technologies to prevent harm such proactive abuse detection and sanctioning technologies that seek to detect and remove harmful content before a user encounters it. This includes proactive text classifiers and image-hashing technology.
Reduce Harm:
- Crisis response protocols for response to crisis events—geopolitical or civic unrest, real-world safety developments.
- Trust and transparency features including in-product transparency elements (ad and recommender transparency), plain language rules and enforcement standards, user-facing notice on enforcement of rules and transparency reports to enable users to understand how rules are enforced.
Remediate Harm:
- Content moderation of user-generated content (UGC) according to content and product policies, including proactive policy enforcement via proactive detection and automated sanctioning.
- User reporting and recourse that enables users to report impending safety risks or safety incidents to platforms, and ways of disputing decisions already made.
Design Elements
Intuitive, user-friendly design, clear and accessible language, and default user settings can play an important role in a user’s experience with a feature or product and mitigate risk of harm. As one of the core principles of SbD, these elements empower a user with control over their own experiences.
Default settings that prioritize safety, privacy: Default settings can be foundational in ensuring user safety, security, and privacy from the outset as they provide users with a baseline configuration that can minimize risks before customization. By prioritizing safety and privacy in default options, product owners can reduce the likelihood of user exposure to certain harms, particularly when users lack the awareness, and/or technical expertise to navigate complex or difficult-to-find settings. Similar to the principle of secure by design, this ensures that the safest options are a standard starting point for every user, and that both the provider and the user play a role in safety outcomes. This includes creating settings that are easily discoverable, and which incorporate clear and accessible language.
Opt-in/opt-out settings: The more private options should be the default when it comes to certain higher-risk settings, like sharing of location, contact lists and access to media files. Consent should not be assumed with such features; consent given in the past doesn’t necessarily mean consent in the present moment. A best practice is to ask for consent from users at the beginning of certain interactions and at regular intervals. When a user engages in a feature that requires information sharing with the company, other users, or third parties, the interface should communicate, in an easy-to-understand way, what information is being shared, how the information will be used, who can see the information and for how long (and ideally, relevant risks with sharing), and a note that the setting is adjustable, and located where users would expect to find it. This includes clear communication on access settings (camera, audio, etc.).
Effective onboarding and statement of norms: There is substantial research on the importance of setting and reinforcing norms in communities, which helps to minimize abuse and promote thriving and quality engagement. In addition to setting normative expectations, the style in which onboarding is presented can also increase the likelihood of absorption. The gaming industry is known for its innovative and gamified methods in onboarding, which can be leveraged to communicate expectations and community guidelines in a way that is less punitive-sounding than many existing onboarding and terms documentation.
Intuitive, transparent design: Another tactic to avoid is deceptive patterns or design. These design patterns can take a number of forms that manipulate or heavily influence users via misleading or poorly designed user interface elements to make unintended choices that may be detrimental to them. This could include, for example, forcing a customer to disclose more information than necessary to access a service, auto-renewals of subscriptions without explicit notice of consent, etc. Transparency, user consent, and content clarity should be at the forefront of the user interface and experience to foster a digital environment where trust and safety is already embedded.
Safety-Enhancing Features
Companies may choose to deploy proactive efforts to reduce the risk of a user encountering harmful actors, behaviors, and/or content via abuse detection technology and other product features:
Screening, Verification, and In-App Communications
In addition to age assurance and/or verification solutions, screening and identity verification play a preventative role in marketplace, dating, and gig economy apps (e.g., Uber Driver Safety, Tinder photo verification). Uber deploys technology to help keep users’ phone numbers private, so neither drivers nor riders will see each other’s numbers when communicating through the Uber app. Once a trip is completed, the app also protects rider information by concealing specific pickup and dropoff addresses in a driver’s trip history. Bumble allows its users to call and video chat with others in the app so that they don’t have to share their personal contact information.
In-Product Tools
Some companies have invested in product-based interventions and friction such as real-time safety alerts, rider journey map design and associated suites of safety tools (e.g. in-app dialing 911, live integrity support chat, sharing location with trusted friends/family), and Bumble’s “Private” detector which automatically blurs lewd images so users can decide if they want to see nudes or not. Discord released its Teen Safety Assist, a new initiative to protect teens through a series of proactive filters and alerts, which will be enabled by default for teens. This feature includes safety alerts that encourage the recipient to double-check if they want to reply and provides links to block the user or view more safety tips if needed. The feature also includes sensitive content filters that blur media that may be sensitive in direct messages, group direct messages with friends, and servers. Other companies have instituted event-driven safety guidance; during Halloween, NextDoor provided safety tips for trick or treaters and their “Treat Map” with context on the feature and what information users can expect to share.
Tailored Product Experiences
Some companies have created tailored product experiences; for example, Pinterest has a feature that displays support material for users who appear to be in distress based on their searches, and it is common for companies to hide edge-case content behind warning screens. Some companies may choose to disable adult-to-minor messaging or minors’ ability to stream live video by default, or block certain terms from triggering results in addition to providing user education upon certain keyword searchers.
Harm can also be reduced by introducing features that encourage healthier content production and interactions among well-intentioned users. Such features include interstitials that create friction by asking users to confirm they want to post toxic content that falls short of violating platform policy, or product nudges to read news stories before they share it.
Transparency on Product Functionality
Companies can provide more information to users on how their features (particularly algorithms, such as recommender systems) operate, which helps users make more informed decisions. The Tech Policy Press has a robust overview of different options for recommender transparency that encompasses categories such as documentation (user-specific and system level), data, code, and research (see A Menu of Recommender Transparency Options). By incorporating elements such as clear data usage policies, explanations of how algorithms work, or the reasons behind content recommendations, in-product transparency helps demystify the product for the user. This level of openness not only fosters trust but also empowers users to make informed decisions about how they interact with the product (5, 6).
Proactive Abuse Detection and Auto-Sanctioning
Technologies such as text classifiers and image hashing technology enable platforms to proactively detect harmful content before a user encounters and reports it. Examples of these technologies include the National Center for Missing and Exploited Children (NCMEC) and Global Internet Forum for Countering Terrorism’s (GIFCT) hash-sharing platforms for child abuse material and terrorist content, respectively. Hashing involves creating a digital fingerprint of a harmful image, identifying matches through comparison and automatically removing these matches before a user encounters it.
Content Moderation
If a product includes user-generated content (UGC), T&S teams will need to consider how SbD informs the work streams around policy development (content, behavior, and product policy). Development of a moderation strategy can be informed via red team exercises or risk assessments that recommend features tailored to prevent specific abuse types and can enable policy writers to craft customizable policy and enforcement design based on the nature of the product/feature. Mentioning safety explicitly in policy, and crafting enforcement guidelines to take into account more difficult-to-verify, but still relevant applications of safety (like how to handle offline harms resulting from products whose features facilitate online and offline meetups) can reinforce a company’s commitment to safety. Bumble, Uber, and Twitch all explicitly mention safety as a critical principle driving their design efforts and policy and enforcement philosophies. Niantic’s Safety Center references safety as a principle and includes a live event code of conduct in addition to other in-game policies that prioritize user safety.
For proactive policy enforcement, deploying Automated Systems and Artificial Intelligence is a choice many companies make, but can also introduce bias, inconsistency, and/or loss of nuance in market-specific contexts. Conducting red team analyses of models by testing for certain biases or other limitations can help illuminate gaps in detection coverage or other potential weaknesses that could result in contextual harms. Examples of services that aid in preventative measures include image hashing for CSAM detection and StopNCII.
User Reporting and Recourse
Academic research about procedural justice applied to online services states that the process through which decisions are made and communicated can have a positive effect on user behavior. If users understand the rules and perceive the loss as fair and trustworthy, they are less likely to commit future offenses. A practical example of this comes from Discord’s Warning Center. User reporting mechanisms should also be designed with the unique safety concerns of a given product/feature in mind. A June 2023 report from PEN and Meedan (Shouting into the Void: Why Reporting Abuse to Social Media Platforms Is So Hard and How to Fix It – PEN America) outlines a number of actionable recommendations to improve the user reporting experience based on research and industry learnings.
While a regulatory requirement for some services, the ability for users to appeal decisions made either by human, automated, or a combination is key to minimizing bias and also is an opportunity for companies to integrate feedback into their moderation infrastructure.
Regulatory and normative pressures will continue to mandate transparency reporting around how companies hold violators of their community guidelines accountable and how companies approach trust and safety more broadly. In addition to typical transparency reporting requirements, some companies have published safety-specific materials documenting what has occurred on their products and how they’ve handled them, and more instructions on how users can empower themselves and their communities around features that facilitate offline engagement.
Crisis Response Protocols
Some companies have protocols for preparedness and response to crisis events such as elections, geopolitical events, civic unrest, or other real-world safety developments that may necessitate bespoke action to minimize harm to users. These options can range from pre-crisis intelligence analysis, forecasting, and incident monitoring to “break glass” measures in product and resource allocation surges to ensure rapid and coordinated responses. By having well-defined crisis response protocols, online services can maintain their integrity, trustworthiness, and user loyalty even in the face of adversity, showing that they are well-prepared to handle unexpected challenges. These types of protocols are now mandated requirements for large online platforms by online safety regulation like the Digital Services Act.
Measuring Safety by Design
Measuring the effectiveness of a preventive approach to safety is challenging given the difficulty of estimating the counterfactual: What might have happened without the preventive intervention? (For more general monitoring of content moderation metrics, see TSPA’s Metrics for Content Moderation.) There are no standardized metrics, and “by design” approaches to safety are shaped heavily by the type of platform and its user base. For example, online marketplaces typically have a strong focus on illegal goods, fraud and scam, while dating platforms mainly want to keep users safe and reduce unsolicited imagery.
One simple way for a platform to track the maturity of its preventive approaches to safety is to assess the maturity of its SbD processes—the extent to which SbD workflows and methods are integrated into business and product development processes. For example, what percentage of product launches have been through a safety risk assessment, and what percentage of the safety features/recommendations were built into the product at launch or rolled out globally? Are safety trainings a required part of the new employee onboarding experience? Are there mandated accessibility or authenticity (such as 2FA) requirements for new feature development? In addition, it’s worth looking at the adoption rate of opt-in safety features and uptake of in-product safety tools to understand whether more safety education or discoverability is needed. Platforms should also consider the availability and coverage of safety interventions for products which have vulnerable user groups like children and other minority groups.
A balance of quantitative (drawing from platform datasets or third party threat monitoring services) and qualitative (user research, customer feedback and surveys) methods can be used to form a holistic picture of a platform’s safety. In addition, a platform must be able to assess actual and perceived harm across its different product surfaces (e.g., messaging, listings, user signups, content uploaded) and across different regions, languages, and user-groups.
In general, assessing the state of safety of a platform involves assessing:
- Harm reduction efforts, which require estimating the baseline prevalence of different types of harms on the platform—actual harm occurring on the platform and user perceptions of safety (i.e. user sentiment scoring).
- Maturity of the platform’s harm redressal capabilities.
Assessing Harm Reduction Efforts
Assessing harm reduction efforts begin with prevalence estimation of the volume of violative content or interactions on the platform. Prevalence is typically measured by sampling random content from a platform, labeling that content, and extrapolating the prevalence (policy violating samples/total sample size) to the total population. Prevalence measurement can be fickle and expensive, but reduction in this measure over time is a strong indicator that a platform’s safety strategy is effective. Other metrics include:
- Violative impressions: One post can be seen by one user, or millions of users. Prevalence metrics can be weighted by their exposure to be more representative of the actual reach of the content (see for example, Meta’s approach here).
- Law enforcement escalations and data requests: Most platforms report data externally in some form; for example,illegal content such as child sexual abusive material (CSAM) is often reported to relevant authorities (e.g,. NCMEC, INHOPE) and cases of imminent threat to life such as suicide and self-harm are often voluntarily escalated to law enforcement. These reports—which may include data requests from law enforcement or government agencies—can be a good signal for the most egregious types of harm on a platform.
- User sentiment score: It is also important to understand whether users feel safe, whether they feel educated and informed, whether they know how to find information, report content, and help their community. A robust safety by design program will typically translate into strong user sentiment scores and strong adoption/uptake of user safety controls. A survey-based metric can be designed here (similar to NPS scoring) based on different questions, including:
- “Did this help you feel safer on the platform?” after a user reporting flow or appeal flow could give a signal on whether users found the right resolution for their problem;
- “Did you see any upsetting or harmful content?” could help platforms understand how users feel about the content they see. Split by user group or region, this can offer insights on the effectiveness of policies and their enforcement workflows, and shed light on gaps for specific user groups.
Assessing the Maturity of Harm Redressal Mechanisms
The effectiveness of harm redressal mechanisms is captured by the following categories of metrics, the publication of which is increasingly required as part of regulatory transparency obligations.
- Enforcement metrics cover sanctions applied or actions taken by platform systems against user generated content/interactions (e.g. account disables, removals, downranking). Proactively deployed automated sanctions serve the preventive goal of addressing harm before a user encounters it. Automation must be balanced with accuracy to reduce the risks of over-enforcement, the costs of which vary depending on who is impacted, what rights are impacted, and the severity of the impact. An uptick in SbD adoption will likely result in a reduction of aggregate enforcement metrics, reflecting the prevention rather than remediation of harm. In addition, a growing share of proactive enforcement and turnaround time indicates enforcement scalability, as user exposure to harm reduces. These metrics can include:
- Percentage of proactive enforcement (the share of content actioned without human intervention, or before users reported it);
- Total Turnaround Time (TAT), from detection to actioning (How long does violative content remain on the platform before it is actioned?).
- Detection and coverage gaps: Often, models perform better for some languages and content types, and looking at incident rate of the same detection mechanism (AI detection) across different product surfaces (e.g. feed, messaging), content type (e.g. video, text), and language can reveal gaps in coverage and performance. Comparing enforcement rates for a given policy over time, across different products, content types of languages can indicate classifier performance gaps or disparities in cultural patterns. In addition, comparing enforcement rates across identity groups can reveal patterns of bias and exclusion.
- Virality alerts: Alerts for spikes in incident volume or rates for the same, or similar content can indicate content is going viral. These can also offer signals into evolving harm trends or emerging crises. Similarly, user reporting rates can also peak at such times indicating potential problem areas. Platforms should monitor these headline metrics carefully, paying attention to unusual and sustained patterns.