JiayouZhangGenbio commited on
Commit
51a9333
·
verified ·
1 Parent(s): a529d7a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -2
README.md CHANGED
@@ -10,10 +10,114 @@ license: other
10
 
11
  # AIDO.StructureTokenizer
12
 
13
- AIDO.StructureTokenizer is a VQ-VAE model for protein structure prediction and tokenization.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ## How to Use
16
- Please see `experiments/AIDO.StructureTokenizer` in [Model Generator](https://github.com/genbio-ai/modelgenerator)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  # Citation
19
  Please cite AIDO.StructureTokenizer using the following BibTex code:
 
10
 
11
  # AIDO.StructureTokenizer
12
 
13
+ AIDO.StructureTokenizer is a VQ-VAE-based tokenizer designed for protein structure prediction and tokenization. It encodes amino-acid-agnostic backbone structures into discrete tokens and reconstructs the full atomic-level structures, including side chains. This tokenizer facilitates the integration of 3D protein structure data with sequence-based language models, enabling efficient and accurate multimodal protein modeling.
14
+
15
+ ## Model Description
16
+
17
+ TODO model figure
18
+
19
+ **AIDO.StructureTokenizer** is built on a Vector Quantized Variational Autoencoder (VQ-VAE) architecture with the following components:
20
+ - Equivariant Encoder: Encodes backbone structures into a latent space that maintains rotational and translational symmetries using the Equiformer architecture.
21
+ - Discrete Codebook: Maps continuous latent vectors into 512 discrete structural tokens.
22
+ - Invariant Decoder: Reconstructs full 3D structures, including side chains, from the structural tokens using an architecture adapted from ESMFold.
23
+
24
+ This model strikes a balance between reconstruction fidelity and structural locality, optimizing its suitability for downstream tasks such as structure prediction, homology detection, and multimodal protein modeling.
25
+
26
+ ### Key Features
27
+
28
+ - Encoding Structures into Tokens (See [genbio-ai/AIDO.StructureEncoder](https://huggingface.co/genbio-ai/AIDO.StructureEncoder))
29
+ - Decoding Tokens into Structures (See [genbio-ai/AIDO.StructureDecoder](https://huggingface.co/genbio-ai/AIDO.StructureDecoder))
30
+ - Reconstructing Structures (See [below](#reconstructing-structures))
31
+ - Structure Prediction (See [this section](https://huggingface.co/genbio-ai/AIDO.Protein2StructureToken-16B/blob/main/README.md#structure-prediction) in genbio-ai/AIDO.Protein2StructureToken-16B)
32
+
33
+ ### Hyperparameters
34
+
35
+ TODO
36
+
37
+ ### Training details
38
+
39
+ TODO
40
+
41
+ ## Results
42
+
43
+ TODO
44
 
45
  ## How to Use
46
+ Please see `experiments/AIDO.StructureTokenizer` in [Model Generator](https://github.com/genbio-ai/modelgenerator) for more details.
47
+
48
+ ### Reconstructing Structures
49
+
50
+ #### Setup
51
+ Install [Model Generator](https://github.com/genbio-ai/modelgenerator)
52
+
53
+ #### Data preparation
54
+
55
+ To reproduce the reconstruction results in the paper, we provide a preprocessed CASP15 dataset at [genbio-ai/sample-structure-dataset](https://huggingface.co/datasets/genbio-ai/sample-structure-dataset). It could be downloaded via
56
+ ```bash
57
+ huggingface-cli download genbio-ai/sample-structure-dataset --repo-type dataset --local-dir ./data/protstruct_sample_data/
58
+ ```
59
+
60
+ This dataset is based on the CASP15 dataset, which can be referenced at:
61
+ - [CASP15 Prediction Center](https://predictioncenter.org/casp15/)
62
+ - [Bhattacharya-Lab/CASP15](https://github.com/Bhattacharya-Lab/CASP15)
63
+
64
+ The downloaded directory includes:
65
+ - A `registries` folder containing a CSV file with metadata such as filenames and PDB IDs.
66
+ - A `CASP15_merged` folder containing PDB files, where domains are merged in the same way as described in [Bhattacharya-Lab/CASP15](https://github.com/Bhattacharya-Lab/CASP15).
67
+
68
+ To use customized data, you can prepare a dataset with the following structure:
69
+ - A folder containing PDB files (supported formats: `cif.gz`, `cif`, `ent.gz`, `pdb`).
70
+
71
+ Then, you need to prepare a registry file in CSV format using the following command:
72
+ ``` bash
73
+ python experiments/AIDO.StructureTokenizer/register_dataset.py \
74
+ --folder_path /path/to/folder_path \
75
+ --format cif.gz \
76
+ --output_file /path/to/output_file.csv
77
+ ```
78
+
79
+ You could replace the `folder_path` and the `registry_path` in the following steps accordingly.
80
+
81
+ #### Running Encoding and Decoding Task
82
+
83
+ If you use the provided CASP15 dataset, you can run the combined encoding and decoding task using the following command:
84
+ ```bash
85
+ CUDA_VISIBLE_DEVICES=0 mgen predict --config=experiments/AIDO.StructureTokenizer/encode_decode.yaml
86
+ ```
87
+
88
+ If you use your own dataset, you need to update the `folder_path` and the `registry_path` in the `encode_decode.yaml` configuration file or override them when running the command. Example:
89
+ ```bash
90
+ CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructuctureTokenizer/encode_decode.yaml \
91
+ --data.init_args.config.proteins_datasets_configs.name="your_dataset_name" \
92
+ --data.init_args.config.proteins_datasets_configs.registry_path="your_dataset_folder_path" \
93
+ --data.init_args.config.proteins_datasets_configs.folder_path="your_dataset_registry_path" \
94
+ --trainer.callbacks.dict_kwargs.output_dir="your_output_dir"
95
+ ```
96
+
97
+ The input and the output can be summarized as follows:
98
+
99
+ **Input:**
100
+ - The PDB files in the dataset folder.
101
+ - The registry file in CSV format indicating the metadata of the dataset.
102
+
103
+ **Output:**
104
+ - The decoded structures and their corresponding original structures will be saved in the output directory specified in the configuration file. By default it is saved in `logs/protstruct_model/`.
105
+ - The decoded structures end with `output.pdb`.
106
+ - The original structures end with `input.pdb`.
107
+
108
+ **Notes:**
109
+ - Decoding the structures could take a long time even when using a GPU.
110
+ - Currently, this function only supports single GPU inference due to the file saving mechanism. We plan to support multi-GPU inference in the future.
111
+ - The reconstructed structures are aligned to the original structures using the Kabsch algorithm. This makes it easier to visualize and compare the structures.
112
+
113
+ #### Visualizing the Reconstructed Structures
114
+
115
+ We use VS Code + [Protein Viewer Extension](https://marketplace.visualstudio.com/items?itemName=ArianJamasb.protein-viewer) to visualize the protein structures. It's a beginner-friendly tool for VS Code users. You could also use your preferred protein structure viewer to visualize the structures (e.g., PyMOL, ChimeraX, etc.), but here we focus on this extension.
116
+
117
+ If you have run the [Running Encoding and Decoding Task](#running-encoding-and-decoding-task), you could find the decoded structures and their corresponding original structures in the output directory. You could visualize them as follows.
118
+ - Find the desired `output.pdb` and `input.pdb` pair in the side panel. Select both files when holding the `Ctrl` key (for Mac users, hold the `Cmd` key). ![Select Files](./assets/images/structure_tokenizer/select_files.png)
119
+ - Right-click on the selected files and choose "Launch Protein Viewer". ![Launch Protein Viewer from File(s)](./assets/images/structure_tokenizer/launch_protein_viewer.png)
120
+ - A new tab will open with the protein structures displayed. You can interact with the structures using the Protein Viewer extension. Wwe have aligned the reconstructed structures to the original structures using the Kabsch algorithm, the displayed structures should be like this, where different colors mean different files. ![Visualize Reconstruction](./assets/images/structure_tokenizer/visualize_reconstruction.png)
121
 
122
  # Citation
123
  Please cite AIDO.StructureTokenizer using the following BibTex code: