Tabular Classification
Scikit-learn
English
lightgbm
biology
chemistry
drug-discovery
pharmacology
toxicology
cytotoxicity
promiscuity
selectivity
safety
polypharmacology
logistic-regression
gradient-boosting
shap
rdkit
molecular-fingerprints
morgan-fingerprints
cubic-regression
binary-classification
statsmodels
research
Eval Results (legacy)
Instructions to use pageman/discovery2-cytotoxicity-models with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use pageman/discovery2-cytotoxicity-models with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("pageman/discovery2-cytotoxicity-models", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
metadata
language:
- en
license: cc-by-nc-sa-4.0
library_name: lightgbm
tags:
- biology
- chemistry
- drug-discovery
- pharmacology
- toxicology
- cytotoxicity
- promiscuity
- selectivity
- safety
- polypharmacology
- logistic-regression
- lightgbm
- gradient-boosting
- shap
- rdkit
- molecular-fingerprints
- morgan-fingerprints
- cubic-regression
- binary-classification
- tabular-classification
- sklearn
- statsmodels
- research
datasets:
- eve-bio/drug-target-activity
- pageman/discovery2-results
metrics:
- roc_auc
- accuracy
model-index:
- name: discovery2-cytotoxicity-cubic-model
results:
- task:
type: tabular-classification
name: Cytotoxicity Prediction from Promiscuity
dataset:
type: pageman/discovery2-results
name: Discovery 2 Promiscuity Scores
split: promiscuity_scores
metrics:
- type: log_likelihood
value: -313.25
name: Log-Likelihood
- type: aic
value: 634.5
name: AIC
- type: other
value: 77.4
name: 50% Cytotoxicity Threshold (hits)
source:
name: Discovery 2 Study
url: https://huggingface.co/datasets/pageman/discovery2-results
- name: discovery2-cytotoxicity-lightgbm
results:
- task:
type: tabular-classification
name: Cytotoxicity Prediction from Molecular Structure
dataset:
type: pageman/discovery2-results
name: Discovery 2 Molecular Fingerprints
split: train
metrics:
- type: roc_auc
value: 0.781
name: Cross-Validation ROC-AUC
config: 5-fold
- type: roc_auc
value: 0.045
name: ROC-AUC Std Dev
source:
name: Discovery 2 Study
url: https://huggingface.co/datasets/pageman/discovery2-results
Discovery 2: Cytotoxicity Prediction Models
Pre-trained models for predicting drug cytotoxicity based on promiscuity and molecular structure.
Models Overview
This repository contains trained models from the Discovery 2 study on selectivity-safety coupling. The models predict cytotoxicity risk based on:
- Promiscuity (number of biological targets a compound hits)
- Molecular structure (Morgan fingerprints)
Model Files
1. Cubic Logistic Regression Models
Main Model: cubic_logistic_model.pkl
- Predicts cytotoxicity probability from overall promiscuity score
- Uses cubic polynomial features (promiscuity, promiscuity², promiscuity³)
- 50% threshold: 77 hits
- Performance: Significantly better than linear model (p < 0.001)
Class-Specific Models:
kinase_cubic_model.pkl- For kinase promiscuity (50% threshold: 25 hits)nr_cubic_model.pkl- For nuclear receptor promiscuity (50% threshold: 31 hits)7tm_cubic_model.pkl- For GPCR/7TM promiscuity (50% threshold: 63 hits)
2. LightGBM Structural Model
File: lightgbm_model.txt
- Predicts cytotoxicity from molecular fingerprints (2048-bit Morgan fingerprints, radius=2)
- Cross-validation AUC: 0.781 ± 0.045
- Identifies structural alerts (toxicophores)
- Can be used for compounds without promiscuity data
3. Metadata Files
cubic_model_metadata.json- Performance metrics for main cubic modelclass_models_metadata.json- Thresholds for class-specific modelslgb_model_metadata.json- LightGBM model performancefeature_stats.json- Feature statistics for normalization
Usage
Installation
pip install joblib lightgbm rdkit numpy pandas statsmodels
Loading Models
import joblib
import lightgbm as lgb
import json
# Load cubic logistic regression model
cubic_model = joblib.load('cubic_logistic_model.pkl')
# Load LightGBM model
lgb_model = lgb.Booster(model_file='lightgbm_model.txt')
# Load metadata
with open('cubic_model_metadata.json', 'r') as f:
cubic_metadata = json.load(f)
with open('feature_stats.json', 'r') as f:
feature_stats = json.load(f)
Predicting from Promiscuity Score
import numpy as np
from statsmodels.tools import add_constant
def predict_cytotoxicity_from_promiscuity(promiscuity_score, model):
"""
Predict cytotoxicity probability from promiscuity score
Args:
promiscuity_score: Number of active assays (hits)
model: Loaded cubic logistic regression model
Returns:
Probability of cytotoxicity (0-1)
"""
# Create cubic features
X = np.array([[promiscuity_score,
promiscuity_score**2,
promiscuity_score**3]])
X_with_const = add_constant(X)
# Predict probability
prob = model.predict(X_with_const)[0]
return prob
# Example usage
promiscuity = 50
prob = predict_cytotoxicity_from_promiscuity(promiscuity, cubic_model)
print(f"Promiscuity: {promiscuity} hits")
print(f"Cytotoxicity probability: {prob:.2%}")
Predicting from Molecular Structure
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
def predict_cytotoxicity_from_smiles(smiles, model):
"""
Predict cytotoxicity from SMILES string
Args:
smiles: SMILES representation of molecule
model: Loaded LightGBM model
Returns:
Probability of cytotoxicity (0-1)
"""
# Generate Morgan fingerprint
mol = Chem.MolFromSmiles(smiles)
if mol is None:
raise ValueError("Invalid SMILES")
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
fp_array = np.array(fp).reshape(1, -1)
# Predict
prob = model.predict(fp_array)[0]
return prob
# Example usage
smiles = "CC(C)Cc1ccc(cc1)C(C)C(O)=O" # Ibuprofen
prob = predict_cytotoxicity_from_smiles(smiles, lgb_model)
print(f"SMILES: {smiles}")
print(f"Cytotoxicity probability: {prob:.2%}")
Class-Specific Predictions
# Load class-specific model
kinase_model = joblib.load('kinase_cubic_model.pkl')
# Predict from kinase-specific promiscuity
kinase_hits = 20
prob = predict_cytotoxicity_from_promiscuity(kinase_hits, kinase_model)
print(f"Kinase promiscuity: {kinase_hits} hits")
print(f"Cytotoxicity probability: {prob:.2%}")
Risk Interpretation
Based on the cubic model thresholds:
| Promiscuity Range | Risk Level | Cytotoxicity Probability |
|---|---|---|
| < 43 hits | Low | < 25% |
| 43-102 hits | Moderate | 25-75% |
| > 102 hits | High | > 75% |
Class-Specific 50% Thresholds:
- Kinase: 25 hits (most sensitive)
- Nuclear Receptor: 31 hits
- 7TM/GPCR: 63 hits (least sensitive)
Model Performance
Cubic Logistic Regression
- Log-likelihood: -313.25
- AIC: 634.50
- Likelihood ratio test: p = 0.001 (significantly better than linear)
- 50% threshold: 77.4 hits
LightGBM Classifier
- Cross-validation AUC: 0.781 ± 0.045
- Features: 2048-bit Morgan fingerprints (radius=2)
- Training set: 1,382 compounds (13.2% cytotoxic)
Key Findings
- Non-linear relationship: Cytotoxicity risk accelerates rapidly above ~50 hits
- Strong predictive power: Compounds with >50 hits are 29.4× more likely to be cytotoxic
- Target class matters: Kinase promiscuity is more dangerous than GPCR promiscuity
- Structural alerts: Specific molecular substructures (toxicophores) are highly predictive
Use Cases
- Early drug discovery: Screen compounds for cytotoxicity risk
- Lead optimization: Prioritize compounds with lower predicted risk
- Polypharmacology assessment: Evaluate safety implications of multi-target drugs
- Structure-activity relationships: Identify problematic structural features
Limitations
- Models trained on specific compound library (1,397 FDA-approved small molecules)
- Cytotoxicity measured by cell viability assays (may not capture all toxicity mechanisms)
- Promiscuity-based models require activity data across multiple targets
- Structure-based model limited to compounds within chemical space of training data
Citation
If you use these models in your research, please cite:
Discovery 2: Cytotoxicity Prediction Models
Models: https://huggingface.co/pageman/discovery2-cytotoxicity-models
Dataset: https://huggingface.co/datasets/pageman/discovery2-results
Related Resources
- Dataset Repository: pageman/discovery2-results - Full analysis, code, and visualizations
- Source Data: eve-bio/drug-target-activity - Raw drug-target activity data
License
These models are provided for research purposes under CC-BY-NC-SA-4.0 license. Please check with the original data sources for licensing terms.
Contact
For questions or issues, please open a discussion on this repository.