///
The `trust-and-safety-models` component within the X Recommendation Algorithm is a critical suite of machine learning models designed to uphold platform integrity by identifying and mitigating harmful
152 views
~152 views from guests
Guest views are estimated from total page views. These include anonymous visitors and users who weren't logged in when they viewed the page.
The trust-and-safety-models component within the X Recommendation Algorithm is a critical suite of machine learning models designed to uphold platform integrity by identifying and mitigating harmful content. These models play a vital role in filtering out Not Safe For Work (NSFW) or abusive content across X's product surfaces, contributing to a safer user experience. This component is part of the broader Core Architecture that powers X's recommendations.
The repository includes implementations for three primary Trust and Safety categories: Abusive Content, Toxicity, and NSFW (Not Safe For Work) content, each with specialized models and training pipelines. All models extensively leverage TensorFlow, often integrating BERT-based text encoders and employing metrics like AUC and Precision-Recall for performance evaluation.
The abusive_model.py file outlines a TensorFlow-based model for detecting various forms of abusive content.
TextEncoder (specifically a TextEncoderPooledOutput subclass derived from twitter.cuad.representation.models.text_encoder) to process tweet text with media annotations, likely using BERT-based embeddings (twitter_multilingual_bert_base_cased_mlm). Additional features like precision_nsfw, has_media, and num_media are incorporated via a FeatureEncoder.MirroredStrategy for distributed training on GPUs. The create_optimizer function (twitter.cuad.representation.models.optimization) is used (e.g., adamw). A custom multilabel_weighted_loss function is employed to address class imbalance, using pos_weight_tensor derived from training data.tf.keras.metrics.AUC with curve="PR" and multi_label=True, indicating a focus on multi-label Precision-Recall AUC. Evaluation is performed across various subsets of data, such as test_only_media, test_only_nsfw, and test_no_media, to understand model performance under different conditions.The toxicity sub-directory contains models aimed at detecting toxic content, with a modular design for loading, training, and rescoring.
load_model.py):
load_encoder function handles loading TextEncoder models (e.g., twitter_bert_base_en_uncased_mlm or twitter_multilingual_bert_base_cased_mlm), supporting local model paths and dynamic shapes.TFAutoModelForSequenceClassification from HuggingFace for models like "bertweet-base."get_loss function supports various loss functions, including BinaryCrossentropy, CategoricalCrossentropy, SparseCategoricalCrossentropy, BinaryFocalCrossentropy, and a custom MaskedBCE._add_additional_embedding_layer and _get_bias for fine-tuning the final dense layers, often with glorot_uniform initialization and sigmoid or softmax activation based on num_classes or num_raters.train.py):
Trainer class orchestrates the training process, handling data loading, preprocessing, optimization, and callbacks.BalancedMiniBatchLoader to address class imbalance, generating mini-batches with stratified sampling.WarmUp and PolynomialDecay, are configured via get_lr_schedule. Optimizers like Adam or AdamW (from tensorflow_addons.optimizers) are supported.AdditionalResultLogger, ControlledStoppingCheckpointCallback, SyncingTensorBoard, GradientLoggingTensorBoard) are used for logging, checkpointing, and early stopping.rescoring.py):
score function enables inference on new dataframes using a trained model. It can load models directly or use a load_inference_func for saved models.The nsfw sub-directory contains models specifically designed to detect NSFW content in both media and text.
nsfw_media.py)tf.RecordDataset examples that contain embedding (fixed-length float32 features, e.g., 256-dimensional, likely from a CLIP-like model) and labels (int64). Data preprocessing involves casting labels and optionally resampling to balance classes.Sequential Keras model with Dense layers, BatchNormalization, and activation functions like tanh or gelu. The final layer is a Dense(1, activation='sigmoid') for binary classification.kerastuner.tuners.BayesianOptimization is used to find optimal hyperparameters for activation, kernel_initializer, num_layers, and units.tf.keras.optimizers.Adam and loss='binary_crossentropy'. tf.keras.callbacks.EarlyStopping is used to prevent overfitting, monitoring validation loss.tf.keras.metrics.PrecisionAtRecall and tf.keras.metrics.AUC(curve="PR"). Performance is visualized using Precision-Recall curves, often comparing different test sets (e.g., "MU test set" and "sens prev test set").nsfw_text.py)tf.data.Dataset slices with one-hot encoded labels.TextEncoder (e.g., twitter_bert_base_en_uncased_augmented_mlm) to generate text embeddings from the processed tweet text. A Dense layer with softmax activation outputs 2-class predictions.create_optimizer (e.g., adamw) and BinaryCrossentropy loss. tf.keras.metrics.AUC(curve='PR') is the primary metric. Callbacks include EarlyStopping, ModelCheckpoint, and TensorBoard for monitoring and logging.classification_report and precision_recall_curve, demonstrating model effectiveness on processed text inputs.