--- language: - en license: cc-by-nc-sa-4.0 library_name: lightgbm tags: - biology - chemistry - drug-discovery - pharmacology - toxicology - cytotoxicity - promiscuity - selectivity - safety - polypharmacology - logistic-regression - lightgbm - gradient-boosting - shap - rdkit - molecular-fingerprints - morgan-fingerprints - cubic-regression - binary-classification - tabular-classification - sklearn - statsmodels - research datasets: - eve-bio/drug-target-activity - pageman/discovery2-results metrics: - roc_auc - accuracy # base_model: null # Not applicable - trained from scratch model-index: - name: discovery2-cytotoxicity-cubic-model results: - task: type: tabular-classification name: Cytotoxicity Prediction from Promiscuity dataset: type: pageman/discovery2-results name: Discovery 2 Promiscuity Scores split: promiscuity_scores metrics: - type: log_likelihood value: -313.25 name: Log-Likelihood - type: aic value: 634.50 name: AIC - type: other value: 77.4 name: 50% Cytotoxicity Threshold (hits) source: name: Discovery 2 Study url: https://huggingface.co/datasets/pageman/discovery2-results - name: discovery2-cytotoxicity-lightgbm results: - task: type: tabular-classification name: Cytotoxicity Prediction from Molecular Structure dataset: type: pageman/discovery2-results name: Discovery 2 Molecular Fingerprints split: train metrics: - type: roc_auc value: 0.781 name: Cross-Validation ROC-AUC config: 5-fold - type: roc_auc value: 0.045 name: ROC-AUC Std Dev source: name: Discovery 2 Study url: https://huggingface.co/datasets/pageman/discovery2-results --- # Discovery 2: Cytotoxicity Prediction Models Pre-trained models for predicting drug cytotoxicity based on promiscuity and molecular structure. ## Models Overview This repository contains trained models from the Discovery 2 study on selectivity-safety coupling. The models predict cytotoxicity risk based on: 1. **Promiscuity** (number of biological targets a compound hits) 2. **Molecular structure** (Morgan fingerprints) ## Model Files ### 1. Cubic Logistic Regression Models **Main Model: `cubic_logistic_model.pkl`** - Predicts cytotoxicity probability from overall promiscuity score - Uses cubic polynomial features (promiscuity, promiscuity², promiscuity³) - **50% threshold**: 77 hits - **Performance**: Significantly better than linear model (p < 0.001) **Class-Specific Models:** - `kinase_cubic_model.pkl` - For kinase promiscuity (50% threshold: 25 hits) - `nr_cubic_model.pkl` - For nuclear receptor promiscuity (50% threshold: 31 hits) - `7tm_cubic_model.pkl` - For GPCR/7TM promiscuity (50% threshold: 63 hits) ### 2. LightGBM Structural Model **File: `lightgbm_model.txt`** - Predicts cytotoxicity from molecular fingerprints (2048-bit Morgan fingerprints, radius=2) - **Cross-validation AUC**: 0.781 ± 0.045 - Identifies structural alerts (toxicophores) - Can be used for compounds without promiscuity data ### 3. Metadata Files - `cubic_model_metadata.json` - Performance metrics for main cubic model - `class_models_metadata.json` - Thresholds for class-specific models - `lgb_model_metadata.json` - LightGBM model performance - `feature_stats.json` - Feature statistics for normalization ## Usage ### Installation ```bash pip install joblib lightgbm rdkit numpy pandas statsmodels ``` ### Loading Models ```python import joblib import lightgbm as lgb import json # Load cubic logistic regression model cubic_model = joblib.load('cubic_logistic_model.pkl') # Load LightGBM model lgb_model = lgb.Booster(model_file='lightgbm_model.txt') # Load metadata with open('cubic_model_metadata.json', 'r') as f: cubic_metadata = json.load(f) with open('feature_stats.json', 'r') as f: feature_stats = json.load(f) ``` ### Predicting from Promiscuity Score ```python import numpy as np from statsmodels.tools import add_constant def predict_cytotoxicity_from_promiscuity(promiscuity_score, model): """ Predict cytotoxicity probability from promiscuity score Args: promiscuity_score: Number of active assays (hits) model: Loaded cubic logistic regression model Returns: Probability of cytotoxicity (0-1) """ # Create cubic features X = np.array([[promiscuity_score, promiscuity_score**2, promiscuity_score**3]]) X_with_const = add_constant(X) # Predict probability prob = model.predict(X_with_const)[0] return prob # Example usage promiscuity = 50 prob = predict_cytotoxicity_from_promiscuity(promiscuity, cubic_model) print(f"Promiscuity: {promiscuity} hits") print(f"Cytotoxicity probability: {prob:.2%}") ``` ### Predicting from Molecular Structure ```python from rdkit import Chem from rdkit.Chem import AllChem import numpy as np def predict_cytotoxicity_from_smiles(smiles, model): """ Predict cytotoxicity from SMILES string Args: smiles: SMILES representation of molecule model: Loaded LightGBM model Returns: Probability of cytotoxicity (0-1) """ # Generate Morgan fingerprint mol = Chem.MolFromSmiles(smiles) if mol is None: raise ValueError("Invalid SMILES") fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048) fp_array = np.array(fp).reshape(1, -1) # Predict prob = model.predict(fp_array)[0] return prob # Example usage smiles = "CC(C)Cc1ccc(cc1)C(C)C(O)=O" # Ibuprofen prob = predict_cytotoxicity_from_smiles(smiles, lgb_model) print(f"SMILES: {smiles}") print(f"Cytotoxicity probability: {prob:.2%}") ``` ### Class-Specific Predictions ```python # Load class-specific model kinase_model = joblib.load('kinase_cubic_model.pkl') # Predict from kinase-specific promiscuity kinase_hits = 20 prob = predict_cytotoxicity_from_promiscuity(kinase_hits, kinase_model) print(f"Kinase promiscuity: {kinase_hits} hits") print(f"Cytotoxicity probability: {prob:.2%}") ``` ### Risk Interpretation Based on the cubic model thresholds: | Promiscuity Range | Risk Level | Cytotoxicity Probability | |-------------------|------------|--------------------------| | < 43 hits | **Low** | < 25% | | 43-102 hits | **Moderate** | 25-75% | | > 102 hits | **High** | > 75% | **Class-Specific 50% Thresholds:** - **Kinase**: 25 hits (most sensitive) - **Nuclear Receptor**: 31 hits - **7TM/GPCR**: 63 hits (least sensitive) ## Model Performance ### Cubic Logistic Regression - **Log-likelihood**: -313.25 - **AIC**: 634.50 - **Likelihood ratio test**: p = 0.001 (significantly better than linear) - **50% threshold**: 77.4 hits ### LightGBM Classifier - **Cross-validation AUC**: 0.781 ± 0.045 - **Features**: 2048-bit Morgan fingerprints (radius=2) - **Training set**: 1,382 compounds (13.2% cytotoxic) ## Key Findings 1. **Non-linear relationship**: Cytotoxicity risk accelerates rapidly above ~50 hits 2. **Strong predictive power**: Compounds with >50 hits are **29.4× more likely** to be cytotoxic 3. **Target class matters**: Kinase promiscuity is more dangerous than GPCR promiscuity 4. **Structural alerts**: Specific molecular substructures (toxicophores) are highly predictive ## Use Cases - **Early drug discovery**: Screen compounds for cytotoxicity risk - **Lead optimization**: Prioritize compounds with lower predicted risk - **Polypharmacology assessment**: Evaluate safety implications of multi-target drugs - **Structure-activity relationships**: Identify problematic structural features ## Limitations - Models trained on specific compound library (1,397 FDA-approved small molecules) - Cytotoxicity measured by cell viability assays (may not capture all toxicity mechanisms) - Promiscuity-based models require activity data across multiple targets - Structure-based model limited to compounds within chemical space of training data ## Citation If you use these models in your research, please cite: ``` Discovery 2: Cytotoxicity Prediction Models Models: https://huggingface.co/pageman/discovery2-cytotoxicity-models Dataset: https://huggingface.co/datasets/pageman/discovery2-results ``` ## Related Resources - **Dataset Repository**: [pageman/discovery2-results](https://huggingface.co/datasets/pageman/discovery2-results) - Full analysis, code, and visualizations - **Source Data**: [eve-bio/drug-target-activity](https://huggingface.co/datasets/eve-bio/drug-target-activity) - Raw drug-target activity data ## License These models are provided for research purposes under CC-BY-NC-SA-4.0 license. Please check with the original data sources for licensing terms. ## Contact For questions or issues, please open a discussion on this repository.