Leveraging Data in Trust & Safety

Data Principles and Goals

Principles are important for effective governance and analysis of data. They serve as guides to tell trust and safety teams what is and isn’t allowed  in their companies and organizations when it comes to using data. They also guide  sound decisions, build trust, and enable greater innovation. Data principles may vary across organizations due to different company cultures and values when it comes to trust and safety, as well as differences when it comes to data governance.

Some of the most common data principles define expectations for how data should be used by trust and safety teams, include:

  • Identify patterns and events;
  • Collect, interpret, and explain data;
  • Create impact;
  • Reduce risk.

The most important thing to remember is that data principles should reflect the goals and priorities of the overall organization, and comply with relevant regulatory, privacy, and contractual obligations. Team members should be able to trust that by following these principles, they’re operating within approved guardrails set by local governing laws, industry or organization best practices, and ethics set by local governing laws, industry or organization best practices, and ethics. 

Exploring and Using Data

Working with Raw Data

One of the most foundational forms of data is fact tables, which often hold event logs—tables with a row of data every time a particular event occurs. These tables usually contain the freshest and fastest data for responding quickly to urgent situations, as well as the full history of a situation.

Fact tables are also easy to aggregate in various ways. For example, getting counts of user reported violations or the latest decision applied to a review. These tables can get large very quickly in technology companies that deal with trust and safety problems, so they tend to only contain the most necessary variables and join onto other tables when other information is needed.

The other common form of data is dimension (or dim) tables—these tables will usually hold details of a specific object or entity. An example of a dim table is a user table containing details like username, date the account was created or that a last login occurred. These tables tend to have more variables and can be helpful for getting more general views, and for spotting common patterns among policy violators.

Data Cleaning and Validation

Organizations typically spend a substantial amount of time and effort to clean the data, including  but not limited to: fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. Particularly when combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. Skipping or rushing data cleaning and validation processes can lead to a variety of significant risks, such as incorrect policy decisions being made or erroneous data being provided to legal or regulatory authorities or the wider public.

To check the validity of a dataset or combination of datasets, consider the following characteristics: 

  • Completeness refers to how comprehensive or whole the data is. This includes both individual known events and/or facts missing from a data set, or information about those events missing. Sometimes incomplete data is unusable. Using data with missing information can lead to mistakes and false conclusions.
  • Consistency means whether the same or similar data kept in different places matches or not. This can also include contradictory data—for example, a record that someone logged in to a platform before an account was created.
  • Accuracy is a measure of whether the data is correct and can be used as a reliable source of information; accuracy is generally established based on comparisons to known examples of labeled data that can be independently verified or otherwise have a high level of trust.
  • Validity refers to whether a particular data value is possible—for example, a negative age or a phone number with letters in it would usually not be valid data points. This is distinct from accuracy because it covers values that should never occur and can be identified through rules and definitions.
  • Uniqueness / Duplication looks at whether each data point is a unique instance or if there are multiple copies of the data. This can include situations with two or more inconsistent records where there should be only one, or complete copies of a record leading to double counting. Another form of duplication is when multiple values represent the same thing—for example, separate values for “New York” and “new york”—which can cause data to be aggregated incorrectly.
  • Timeliness refers to how up to date the data is. Timely data is important to be able to react and respond to new insights quickly, but can also cause confusion and errors if old and new data are mixed, or a lag in data isn’t apparent.

Responsible and Ethical Considerations

The following sections serve as a high-level overview of responsible and ethical considerations. The following materials are intended to serve as educational purposes only, and do not constitute legal advice. Readers should work cross-functionally with Legal, Privacy, HR, and other departments to determine specific applications.

Privacy

Because of the sensitive nature of the kind of data described above, it is crucial to consider where this data will be stored, and who will have access. Common practices include:

  • Storing these types of data in a separate database or behind access gates so that only a limited number of people have access;
  • Granting access to only essential research and data staff with them serving as the POCs to provide summary-level data as needed;
  • Signing an NDA and privacy statement prior to being granted access;
  • Monitoring systems to identify any attempt to access private data inappropriately, either externally or by staff.

In Trust and Safety it is common for there to be a wide range of different privacy levels for data—public comments are generally much less restricted than private messages, for example. Similarly, the level of access granted can depend on how likely or serious a potential policy violation is.

Of particular note is Personally Identifiable Information, or PII, which can easily be traced or associated with a specific person. Names, phone numbers, and email addresses are clear examples of PII. However, PII in trust and safety can also be much broader—most forms of user generated content might potentially contain something uniquely identifiable and will probably also be PII. Using or sharing PII data is often heavily restricted and may require legal or policy review.

Occasionally trust and safety professionals are required to share data with external parties, such as reporting imminent threats to law enforcement or providing summaries to government regulators. Any data shared should always be explicitly documented and reviewed in detail to ensure it is accurate and appropriate to share. More details on data requests can be found in the Law Enforcement Chapter.

Protections of personnel data also need to be in place for the security and wellbeing of staff members. This may be especially important given that content moderators may face potential retaliation over their review decisions.

Although debates still exist regarding under what circumstances informed consent should be collected and how, many T&S data professionals do consider the importance of informed consent.  Readers should work cross-functionally with Legal, Privacy, HR, and other departments to determine specific applications. The following materials are intended to serve as educational purposes only, and do not constitute legal advice.

In addition to meeting the legal requirements regarding informed consent, data storage and data retention, it is generally recommended to also reference best practices on research and data from academia. For instance, the 3 principles (i.e., respect for persons, beneficence, and justice) listed in the Belmont Report often serve as the gold-standard for ethical research and data use in the United States.

Whenever collecting sensitive data (e.g., wellness metrics, biometrics, PII), it is important to have participants agree to a consent form prior to providing their data. The Collaborative Institutional Training Initiative (CITI) offers a variety of training courses on ethics (e.g., informed consent, conflict of interest) and serves as an excellent resource. Given that what requires informed consent and what constitutes informed consent may vary based on geographic locations, trust and safety teams may want to elicit help from Legal and from local agencies to ensure all the legal, local, and ethical guidelines are met. It is important to inform participants as to how their data will be used, who will have access to their data, and how their data will be stored.