Trust and Safety Models

The trust-and-safety-models component within the X Recommendation Algorithm is a critical suite of machine learning models designed to uphold platform integrity by identifying and mitigating harmful content. These models play a vital role in filtering out Not Safe For Work (NSFW) or abusive content across X's product surfaces, contributing to a safer user experience. This component is part of the broader Core Architecture that powers X's recommendations.

The repository includes implementations for three primary Trust and Safety categories: Abusive Content, Toxicity, and NSFW (Not Safe For Work) content, each with specialized models and training pipelines. All models extensively leverage TensorFlow, often integrating BERT-based text encoders and employing metrics like AUC and Precision-Recall for performance evaluation.

Abusive Content Model

The abusive_model.py file outlines a TensorFlow-based model for detecting various forms of abusive content.

Purpose: The model aims to predict different categories of punitive actions related to content, including self-harm, hateful conduct, and other abuse policy violations. It is framed as a multi-label classification problem.
Architecture: It utilizes a TextEncoder (specifically a TextEncoderPooledOutput subclass derived from twitter.cuad.representation.models.text_encoder) to process tweet text with media annotations, likely using BERT-based embeddings (twitter_multilingual_bert_base_cased_mlm). Additional features like precision_nsfw, has_media, and num_media are incorporated via a FeatureEncoder.
Training: Training involves a MirroredStrategy for distributed training on GPUs. The create_optimizer function (twitter.cuad.representation.models.optimization) is used (e.g., adamw). A custom multilabel_weighted_loss function is employed to address class imbalance, using pos_weight_tensor derived from training data.
Evaluation: Metrics include tf.keras.metrics.AUC with curve="PR" and multi_label=True, indicating a focus on multi-label Precision-Recall AUC. Evaluation is performed across various subsets of data, such as test_only_media, test_only_nsfw, and test_no_media, to understand model performance under different conditions.

Toxicity Models

The toxicity sub-directory contains models aimed at detecting toxic content, with a modular design for loading, training, and rescoring.

Purpose: These models predict the toxicity of text, potentially for different languages, and can involve multi-task learning for various content-related classifications.
Model Loading (load_model.py):
- The load_encoder function handles loading TextEncoder models (e.g., twitter_bert_base_en_uncased_mlm or twitter_multilingual_bert_base_cased_mlm), supporting local model paths and dynamic shapes.
- It can also load TFAutoModelForSequenceClassification from HuggingFace for models like "bertweet-base."
- The get_loss function supports various loss functions, including BinaryCrossentropy, CategoricalCrossentropy, SparseCategoricalCrossentropy, BinaryFocalCrossentropy, and a custom MaskedBCE.
- Models can incorporate an _add_additional_embedding_layer and _get_bias for fine-tuning the final dense layers, often with glorot_uniform initialization and sigmoid or softmax activation based on num_classes or num_raters.
Training (train.py):
- The Trainer class orchestrates the training process, handling data loading, preprocessing, optimization, and callbacks.
- It uses a BalancedMiniBatchLoader to address class imbalance, generating mini-batches with stratified sampling.
- Learning rate schedules, such as WarmUp and PolynomialDecay, are configured via get_lr_schedule. Optimizers like Adam or AdamW (from tensorflow_addons.optimizers) are supported.
- Callbacks (AdditionalResultLogger, ControlledStoppingCheckpointCallback, SyncingTensorBoard, GradientLoggingTensorBoard) are used for logging, checkpointing, and early stopping.
Rescoring (rescoring.py):
- The score function enables inference on new dataframes using a trained model. It can load models directly or use a load_inference_func for saved models.
- It handles predictions and adds scores back to the input dataframe. Multi-label or multi-class outputs are processed to populate specific prediction columns.

NSFW (Not Safe For Work) Models

The nsfw sub-directory contains models specifically designed to detect NSFW content in both media and text.

NSFW Media (`nsfw_media.py`)

Purpose: This model classifies media content as NSFW based on image embeddings.
Data Handling: It processes tf.RecordDataset examples that contain embedding (fixed-length float32 features, e.g., 256-dimensional, likely from a CLIP-like model) and labels (int64). Data preprocessing involves casting labels and optionally resampling to balance classes.
Architecture: The model uses a Sequential Keras model with Dense layers, BatchNormalization, and activation functions like tanh or gelu. The final layer is a Dense(1, activation='sigmoid') for binary classification.
Hyperparameter Tuning: kerastuner.tuners.BayesianOptimization is used to find optimal hyperparameters for activation, kernel_initializer, num_layers, and units.
Training: Optimized with tf.keras.optimizers.Adam and loss='binary_crossentropy'. tf.keras.callbacks.EarlyStopping is used to prevent overfitting, monitoring validation loss.
Evaluation: Metrics include tf.keras.metrics.PrecisionAtRecall and tf.keras.metrics.AUC(curve="PR"). Performance is visualized using Precision-Recall curves, often comparing different test sets (e.g., "MU test set" and "sens prev test set").

NSFW Text (`nsfw_text.py`)

Purpose: This model classifies tweet text as NSFW.
Data Preprocessing: Text is cleaned by removing URLs, mentions, newlines, and emojis using regular expressions. The cleaned text is then converted into tf.data.Dataset slices with one-hot encoded labels.
Architecture: It uses a TextEncoder (e.g., twitter_bert_base_en_uncased_augmented_mlm) to generate text embeddings from the processed tweet text. A Dense layer with softmax activation outputs 2-class predictions.
Training: Optimized using create_optimizer (e.g., adamw) and BinaryCrossentropy loss. tf.keras.metrics.AUC(curve='PR') is the primary metric. Callbacks include EarlyStopping, ModelCheckpoint, and TensorBoard for monitoring and logging.
Evaluation: Performance is assessed using classification_report and precision_recall_curve, demonstrating model effectiveness on processed text inputs.

Trust and Safety Models

Page Viewers

Guest Views

Trust and Safety Models

Abusive Content Model

Toxicity Models

NSFW (Not Safe For Work) Models

NSFW Media (`nsfw_media.py`)

NSFW Text (`nsfw_text.py`)

Trust and Safety Models

Page Viewers

Guest Views

Trust and Safety Models

Abusive Content Model

Toxicity Models

NSFW (Not Safe For Work) Models

NSFW Media (nsfw_media.py)

NSFW Text (nsfw_text.py)

NSFW Media (`nsfw_media.py`)

NSFW Text (`nsfw_text.py`)