guyyanai commited on
Commit
da7acdf
·
verified ·
1 Parent(s): 506bf80

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -6
README.md CHANGED
@@ -1,10 +1,98 @@
1
  ---
2
  license: apache-2.0
 
 
3
  base_model:
4
- - facebook/esm2_t12_35M_UR50D
5
- - EvolutionaryScale/esm3-sm-open-v1
6
  tags:
7
- - biology
8
- - contrastive-learning
9
- - proteins
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: feature-extraction
4
+ library_name: pytorch
5
  base_model:
6
+ - facebook/esm2_t12_35M_UR50D
7
+ - EvolutionaryScale/esm3-sm-open-v1
8
  tags:
9
+ - biology
10
+ - bioinformatics
11
+ - protein
12
+ - protein-embeddings
13
+ - contrastive-learning
14
+ - multimodal
15
+ - structure
16
+ - sequence
17
+ - sequence-segments
18
+ - pytorch
19
+ ---
20
+
21
+ # CLSS (Contrastive Learning Sequence–Structure)
22
+
23
+ CLSS is a **self-supervised, two-tower contrastive model** that **co-embeds protein sequences and protein structures into a shared latent space**, enabling unified analysis of protein space across modalities.
24
+
25
+ **Links**
26
+ - Hugging Face model repo: https://huggingface.co/guyyanai/CLSS
27
+ - Code + examples (`clss-model`): https://github.com/guyyanai/CLSS
28
+ - Paper (bioRxiv): https://doi.org/10.1101/2025.09.05.674454
29
+ - Interactive CLSS viewer: https://gabiaxel.github.io/clss-viewer/
30
+
31
+ ---
32
+
33
+ ## Model description
34
+
35
+ ### Architecture (high level)
36
+
37
+ CLSS follows a **two-tower architecture**:
38
+
39
+ - **Sequence tower:** a trainable ESM2-like sequence encoder
40
+ - **Structure tower:** a frozen ESM3 structure encoder
41
+ - Each tower is followed by a lightweight **linear projection head** mapping into a shared embedding space, with **L2-normalized outputs**
42
+
43
+ The result is a pair of embeddings (sequence and structure) that live in the **same latent space**, making cosine similarity directly comparable across modalities.
44
+
45
+ The paper’s primary configuration uses **32-dimensional embeddings**, but multiple embedding sizes are provided in this repository.
46
+
47
+ ### Training objective
48
+
49
+ CLSS is trained with a **CLIP-style contrastive objective**, aligning:
50
+ - **Random sequence segments**
51
+ - With their corresponding **full-domain protein structures**
52
+
53
+ **No** hierarchical labels (e.g. ECOD or CATH) are used during training; structural and evolutionary organization emerges implicitly.
54
+
55
+ ---
56
+
57
+ ## Files in this repository
58
+
59
+ This Hugging Face repository contains multiple PyTorch Lightning checkpoints, differing only in **embedding dimensionality**:
60
+
61
+ - `h8_r10.lckpt` → 8-dimensional embeddings
62
+ - `h16_r10.lckpt` → 16-dimensional embeddings
63
+ - `h32_r10.lckpt` → 32-dimensional embeddings (paper default)
64
+ - `h64_r10.lckpt` → 64-dimensional embeddings
65
+ - `h128_r10.lckpt` → 128-dimensional embeddings
66
+
67
+ ---
68
+
69
+ ## How to use CLSS
70
+
71
+ CLSS is intended to be used via the **`clss-model` Python library**, which provides:
72
+
73
+ - Model loading from Lightning checkpoints
74
+ - End-to-end inference examples
75
+ - Scripts used for generating interactive protein space maps
76
+
77
+ ---
78
+
79
+ ## License
80
+
81
+ The CLSS codebase is released under the **Apache 2.0 License**.
82
+ Please consult the repository for details on third-party model dependencies.
83
+
84
+ ---
85
+
86
+ ## Citation
87
+
88
+ If you use CLSS, please cite:
89
+
90
+ ```bibtex
91
+ @article{Yanai2025CLSS,
92
+ title = {Contrastive learning unites sequence and structure in a global representation of protein space},
93
+ author = {Yanai, Guy and Axel, Gabriel and Longo, Liam M. and Ben-Tal, Nir and Kolodny, Rachel},
94
+ journal = {bioRxiv},
95
+ year = {2025},
96
+ doi = {10.1101/2025.09.05.674454},
97
+ url = {https://doi.org/10.1101/2025.09.05.674454}
98
+ }