Update `README.md` with metadata
Browse filesThis PR updates the metadata in the `README.md` file to include the `license` (MIT inherited from https://github.com/microsoft/dayhoff), `pipeline`, `library`, and `dataset`; which all help with visibility, transparency and discoverability in the Hugging Face Hub.
Note that the paper would be automatically linked when the URL is included within the `README.md`, but at the moment only Arxiv is supported, and given that the paper has been published in bioRxiv it won't be linked yet, but still would be great to include the reference to the paper somewhere in the README.
README.md
CHANGED
|
@@ -1,6 +1,14 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
| 4 |
# Model Card for Dayhoff
|
| 5 |
|
| 6 |
Dayhoff is an Atlas of both protein sequence data and generative language models — a centralized resource that brings together 3.34 billion protein sequences across 1.7 billion clusters of metagenomic and natural protein sequences (GigaRef), 46 million structure-derived synthetic sequences (BackboneRef), and 16 million multiple sequence alignments (OpenProteinSet). These models can natively predict zero-shot mutation effects on fitness, scaffold structural motifs by conditioning on evolutionary or structural context, and perform guided generation of novel proteins within specified families. Learning from metagenomic and structure-based synthetic data from the Dayhoff Atlas increased the cellular expression rates of generated proteins, highlighting the real-world value of expanding the scale, diversity, and novelty of protein sequence data.
|
|
|
|
| 1 |
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- protein-generation
|
| 7 |
+
- jamba
|
| 8 |
+
datasets:
|
| 9 |
+
- microsoft/Dayhoff
|
| 10 |
---
|
| 11 |
+
|
| 12 |
# Model Card for Dayhoff
|
| 13 |
|
| 14 |
Dayhoff is an Atlas of both protein sequence data and generative language models — a centralized resource that brings together 3.34 billion protein sequences across 1.7 billion clusters of metagenomic and natural protein sequences (GigaRef), 46 million structure-derived synthetic sequences (BackboneRef), and 16 million multiple sequence alignments (OpenProteinSet). These models can natively predict zero-shot mutation effects on fitness, scaffold structural motifs by conditioning on evolutionary or structural context, and perform guided generation of novel proteins within specified families. Learning from metagenomic and structure-based synthetic data from the Dayhoff Atlas increased the cellular expression rates of generated proteins, highlighting the real-world value of expanding the scale, diversity, and novelty of protein sequence data.
|