BioMike commited on
Commit
0344cda
·
verified ·
1 Parent(s): a3e78ac

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +270 -0
README.md ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: gliner
6
+ pipeline_tag: token-classification
7
+ tags:
8
+ - NER
9
+ - GLiNER
10
+ - information extraction
11
+ - encoder
12
+ - entity recognition
13
+ - modernbert
14
+ - bi-encoder
15
+ - scalable-ner
16
+ - zero-shot-ner
17
+ base_model:
18
+ - jhu-clsp/ettin-encoder-32m
19
+ - jhu-clsp/ettin-encoder-68m
20
+ - jhu-clsp/ettin-encoder-150m
21
+ - jhu-clsp/ettin-encoder-400m
22
+ - sentence-transformers/all-MiniLM-L6-v2
23
+ - sentence-transformers/all-MiniLM-L12-v2
24
+ - BAAI/bge-small-en-v1.5
25
+ - BAAI/bge-base-en-v1.5
26
+ ---
27
+
28
+ # GLiNER-bi-Encoder: Scalable Zero-Shot Named Entity Recognition
29
+
30
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/6405f62ba577649430be5124/2PPPxCfKpt9eS9D1_anf8.png)
31
+
32
+ ## About
33
+
34
+ GLiNER-bi-Encoder is a novel architecture for Named Entity Recognition (NER) that combines zero-shot flexibility with industrial-scale efficiency. Unlike the original GLiNER, which uses joint encoding, the bi-encoder design **decouples text and entity-type encoding**, enabling the recognition of thousands of entity types simultaneously with minimal computational overhead.
35
+
36
+ ### Key Advantages
37
+
38
+ **Massive Scalability**: Handle 1000+ entity types with near-constant inference speed when using pre-computed label embeddings
39
+
40
+ **130× Faster**: Up to 130× throughput improvement compared to uni-encoder approaches at 1024 entity types
41
+
42
+ **State-of-the-Art Performance**: Achieves 61.5% Micro-F1 on CrossNER benchmark in zero-shot setting
43
+
44
+ **Efficient Caching**: Pre-compute and cache entity type embeddings for instant reuse across millions of documents
45
+
46
+ ## Architecture
47
+
48
+ The bi-encoder architecture employs two specialized, independent transformers:
49
+ - **Text Encoder**: Processes input sequences using ModernBERT-based encoders (Ettin family)
50
+ - **Label Encoder**: Embeds entity type descriptions using specialized sentence transformers (BGE, MiniLM)
51
+
52
+ This separation removes the context-window bottleneck and enables:
53
+ - Pre-computation of entity type embeddings
54
+ - Constant memory usage for text encoding regardless of entity count
55
+ - Efficient nearest-neighbor search for entity matching
56
+
57
+ ## Model Variants
58
+
59
+ GLiNER-bi-V2 Models:
60
+
61
+ | Model name | Params | Text Encoder | Label Encoder | Avg. CrossNER | Inference Speed (H100, ex/s) | Inference Speed (pre-computed) |
62
+ |------------|--------|--------------|---------------|---------------|------------------------------|--------------------------------|
63
+ | [gliner-bi-edge-v2.0](https://huggingface.co/knowledgator/gliner-bi-edge-v2.0) | 60 M | [ettin-encoder-32m](https://huggingface.co/jhu-clsp/ettin-encoder-32m) | [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 54.0% | 13.64 | 24.62 |
64
+ | [gliner-bi-small-v2.0](https://huggingface.co/knowledgator/gliner-bi-small-v2.0) | 108 M | [ettin-encoder-68m](https://huggingface.co/jhu-clsp/ettin-encoder-68m) | [all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) | 57.2% | 7.99 | 15.22 |
65
+ | [gliner-bi-base-v2.0](https://huggingface.co/knowledgator/gliner-bi-base-v2.0) | 194 M | [ettin-encoder-150m](https://huggingface.co/jhu-clsp/ettin-encoder-150m) | [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | 60.3% | 5.91 | 9.51 |
66
+ | [gliner-bi-large-v2.0](https://huggingface.co/knowledgator/gliner-bi-large-v2.0) | 530 M | [ettin-encoder-400m](https://huggingface.co/jhu-clsp/ettin-encoder-400m) | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 61.5% | 2.68 | 3.60 |
67
+
68
+ **Recommendation**: The **base variant (194M)** achieves 98% of large model performance while operating 2.6× faster, making it optimal for most production scenarios.
69
+
70
+ ## Installation & Usage
71
+
72
+ ### Installation
73
+ ```bash
74
+ pip install gliner -U
75
+ pip install transformers>=4.48.0
76
+ ```
77
+
78
+ For flash attention support:
79
+ ```bash
80
+ pip install flash-attn triton
81
+ ```
82
+
83
+ ### Basic Usage
84
+ ```python
85
+ from gliner import GLiNER
86
+
87
+ # Load model
88
+ model = GLiNER.from_pretrained("knowledgator/gliner-bi-base-v2.0")
89
+
90
+ text = """
91
+ Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards, a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player.
92
+ """
93
+
94
+ labels = ["person", "award", "date", "competitions", "teams"]
95
+
96
+ entities = model.predict_entities(text, labels, threshold=0.3)
97
+
98
+ for entity in entities:
99
+ print(entity["text"], "=>", entity["label"])
100
+ ```
101
+
102
+ **Output:**
103
+ ```
104
+ Cristiano Ronaldo dos Santos Aveiro => person
105
+ 5 February 1985 => date
106
+ Al Nassr => teams
107
+ Portugal national team => teams
108
+ Ballon d'Or => award
109
+ UEFA Men's Player of the Year Awards => award
110
+ European Golden Shoes => award
111
+ ```
112
+
113
+ ### Advanced Usage: Pre-computing Entity Embeddings
114
+
115
+ For scenarios with large, static entity taxonomies (hundreds to millions of types):
116
+ ```python
117
+ from gliner import GLiNER
118
+
119
+ model = GLiNER.from_pretrained("knowledgator/gliner-bi-base-v2.0")
120
+
121
+ # Pre-compute embeddings for thousands of entity types
122
+ entity_types = ["person", "organization", "location", ...] # Can be thousands
123
+ texts = ["Your documents here", ...]
124
+
125
+ # Encode entity types once
126
+ entity_embeddings = model.encode_labels(entity_types, batch_size=8)
127
+
128
+ # Use pre-computed embeddings for fast inference
129
+ outputs = model.batch_predict_with_embeds(texts, entity_embeddings, entity_types)
130
+ ```
131
+
132
+ This approach provides:
133
+ - **130× speedup** at 1024 entity types
134
+ - **Constant inference time** regardless of entity count
135
+ - **Efficient caching** for repeated use
136
+
137
+ ### Flash Attention & Extended Context
138
+ ```python
139
+ model = GLiNER.from_pretrained(
140
+ "knowledgator/gliner-bi-base-v2.0",
141
+ _attn_implementation='flash_attention_2',
142
+ max_len=2048
143
+ ).to('cuda:0')
144
+ ```
145
+
146
+ ### Zero-Shot NER Performance
147
+
148
+ Comprehensive evaluation across 19 diverse NER datasets:
149
+
150
+ | Dataset | gliner-bi-edge-v2.0 | gliner-bi-small-v2.0 | gliner-bi-base-v2.0 | gliner-bi-large-v2.0 |
151
+ |---------|---------------------|----------------------|---------------------|----------------------|
152
+ | ACE 2004 | 26.4% | 27.5% | 28.9% | 31.9% |
153
+ | ACE 2005 | 26.2% | 28.1% | 30.0% | 31.4% |
154
+ | AnatEM | 39.1% | 43.6% | 35.4% | 39.5% |
155
+ | Broad Tweet Corpus | 70.0% | 71.7% | 72.1% | 70.9% |
156
+ | CoNLL 2003 | 61.6% | 64.2% | 65.6% | 66.5% |
157
+ | FabNER | 22.4% | 23.2% | 24.3% | 22.7% |
158
+ | FindVehicle | 35.6% | 40.3% | 40.6% | 39.1% |
159
+ | GENIA_NER | 50.1% | 53.8% | 56.8% | 60.1% |
160
+ | HarveyNER | 15.0% | 10.6% | 12.6% | 14.7% |
161
+ | MultiNERD | 64.6% | 66.0% | 68.0% | 64.0% |
162
+ | Ontonotes | 31.4% | 31.9% | 33.3% | 32.5% |
163
+ | PolyglotNER | 45.1% | 46.3% | 46.6% | 46.8% |
164
+ | TweetNER7 | 36.9% | 40.9% | 40.4% | 41.7% |
165
+ | WikiANN en | 52.3% | 54.0% | 54.9% | 56.3% |
166
+ | WikiNeural | 78.0% | 79.9% | 80.0% | 76.6% |
167
+ | bc2gm | 58.1% | 59.9% | 62.7% | 61.4% |
168
+ | bc4chemd | 45.8% | 49.1% | 53.6% | 50.5% |
169
+ | bc5cdr | 68.5% | 71.5% | 73.0% | 71.7% |
170
+ | ncbi | 65.9% | 65.4% | 65.2% | 65.9% |
171
+ | **Average** | **47.0%** | **48.8%** | **49.7%** | **49.7%** |
172
+
173
+ ### CrossNER Zero-Shot Benchmark
174
+
175
+ | Dataset | gliner-bi-edge-v2.0 | gliner-bi-small-v2.0 | gliner-bi-base-v2.0 | gliner-bi-large-v2.0 |
176
+ |---------|---------------------|----------------------|---------------------|----------------------|
177
+ | CrossNER_AI | 53.8% | 54.7% | 58.3% | 57.4% |
178
+ | CrossNER_literature | 56.2% | 62.6% | 65.2% | 63.2% |
179
+ | CrossNER_music | 68.2% | 72.3% | 73.4% | 74.0% |
180
+ | CrossNER_politics | 68.7% | 70.0% | 70.8% | 73.0% |
181
+ | CrossNER_science | 63.2% | 66.1% | 68.0% | 67.6% |
182
+ | mit-movie | 30.5% | 35.2% | 46.2% | 51.0% |
183
+ | mit-restaurant | 37.1% | 39.5% | 40.3% | 44.3% |
184
+ | **Average (Zero-Shot Benchmark)** | **54.0%** | **57.2%** | **60.3%** | **61.5%** |
185
+
186
+ ### Inference Speed Comparison
187
+
188
+ Throughput (examples/second) by number of entity types on H100 GPU (batch_size=1):
189
+
190
+ | Model | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | **Avg** |
191
+ |-------|---|---|---|---|----|----|----|-----|-----|-----|------|---------|
192
+ | gliner-bi-edge-v2.0 | 17.0 | 27.0 | 5.05 | 22.4 | 17.5 | 13.9 | 15.2 | 12.5 | 10.8 | 5.43 | 3.23 | **13.64** |
193
+ | gliner-bi-edge-v2.0 (pre-computed) | 19.3 | 25.0 | 28.2 | 32.6 | 31.0 | 32.6 | 22.2 | 22.7 | 22.2 | 16.9 | 18.3 | **24.62** |
194
+ | gliner-bi-small-v2.0 | 12.5 | 12.8 | 5.98 | 11.6 | 10.6 | 9.43 | 6.94 | 7.35 | 5.74 | 3.33 | 1.60 | **7.99** |
195
+ | gliner-bi-small-v2.0 (pre-computed) | 14.7 | 15.9 | 14.3 | 15.3 | 15.4 | 15.4 | 15.6 | 15.3 | 15.5 | 15.7 | 14.3 | **15.22** |
196
+ | gliner-bi-base-v2.0 | 8.13 | 8.62 | 4.85 | 8.00 | 7.52 | 6.76 | 5.71 | 5.21 | 4.64 | 3.21 | 2.30 | **5.91** |
197
+ | gliner-bi-base-v2.0 (pre-computed) | 9.52 | 10.2 | 9.80 | 9.95 | 10.0 | 9.93 | 8.93 | 6.71 | 9.35 | 9.71 | 10.5 | **9.51** |
198
+ | gliner-bi-large-v2.0 | 3.52 | 2.53 | 3.87 | 3.50 | 3.66 | 3.19 | 1.90 | 2.46 | 2.39 | 1.62 | 0.87 | **2.68** |
199
+ | gliner-bi-large-v2.0 (pre-computed) | 4.37 | 4.07 | 4.53 | 4.54 | 4.47 | 3.46 | 3.85 | 3.04 | 2.82 | 1.84 | 2.64 | **3.60** |
200
+ | | | | | | | | | | | | | |
201
+ | gliner_small-v2.5 (uni-encoder) | 10.7 | 14.6 | 14.1 | 13.2 | 11.9 | 10.3 | 7.91 | 4.26 | 1.29 | 0.43 | 0.14 | **8.08** |
202
+ | gliner_medium-v2.5 (uni-encoder) | 7.81 | 8.51 | 8.39 | 7.58 | 7.12 | 5.62 | 4.18 | 2.19 | 0.68 | 0.23 | 0.07 | **4.76** |
203
+ | gliner_large-v2.5 (uni-encoder) | 2.89 | 3.28 | 3.29 | 2.90 | 2.61 | 2.33 | 1.71 | 1.12 | 0.31 | 0.09 | 0.03 | **1.87** |
204
+
205
+ **Key Insight**: Bi-encoder with pre-computed embeddings maintains near-constant speed (5.2% degradation from 1→1024 labels) while uni-encoder shows 98.7% degradation.
206
+
207
+ ## Use Cases
208
+
209
+ ### Biomedical Entity Linking
210
+ Process millions of documents against UMLS (4M+ concepts), SNOMED CT, or other large medical ontologies with pre-computed embeddings.
211
+
212
+ ### Enterprise Knowledge Extraction
213
+ Deploy dynamic taxonomies that evolve without model retraining. Add new entity types instantly by computing their embeddings.
214
+
215
+ ### Scientific Literature Mining
216
+ Extract entities across multiple specialized domains (chemistry, biology, physics) with domain-specific label encoders.
217
+
218
+ ## Entity Linking with GLiNKER
219
+
220
+ GLiNER-bi-Encoder extends naturally to entity linking through the **GLiNKER** framework—a modular DAG-based pipeline for:
221
+ - Mention extraction with GLiNER
222
+ - Candidate retrieval from knowledge bases via pre-computed embeddings
223
+ - Entity disambiguation using bi-encoder scoring
224
+
225
+ **Learn more**: [GLiNKER Repository](https://github.com/Knowledgator/GLinker)
226
+
227
+ ## Model Details
228
+
229
+ ### Training Data
230
+ - **Pre-training**: 8M samples (Large/Base/Small), 10M samples (Edge) from FineFineWeb, annotated with GPT-4o
231
+ - **Post-training**: 40K high-quality samples with sequences up to 2048 tokens for long-context refinement
232
+
233
+ ### Training Configuration
234
+ - **Focal Loss**: α=0.7 (pre-training), α=0.8 (post-training), γ=2.0
235
+ - **Optimizer**: AdamW with differential learning rates (encoder: 1e-5, other: 3e-5)
236
+ - **Context Length**: 1024 tokens (pre-training), 2048 tokens (post-training)
237
+ - **Maximum Span Width**: 12 tokens
238
+ - **Dropout**: 0.35
239
+
240
+ ## Citation
241
+
242
+ If you use GLiNER-bi-Encoder in your research, please cite:
243
+ ```bibtex
244
+ @misc{stepanov2024glinermultitask,
245
+ title={GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks},
246
+ author={Ihor Stepanov and Mykhailo Shtopko},
247
+ year={2024},
248
+ eprint={2406.12925},
249
+ archivePrefix={arXiv},
250
+ primaryClass={cs.LG}
251
+ }
252
+ ```
253
+
254
+ ## Acknowledgments
255
+
256
+ We sincerely thank Urchade Zaratiana (creator of GLiNER) and Tom Aarsen (maintainer of Sentence Transformers) for their foundational work.
257
+
258
+ ## Join Our Community
259
+
260
+ Connect with our community on Discord for news, support, and discussions: [Join Discord](https://discord.gg/HbW9aNJ9)
261
+
262
+ ## Resources
263
+
264
+ - **Paper**: [arXiv preprint (coming soon)](https://arxiv.org)
265
+ - **GLiNKER Framework**: [GLiNKER](https://github.com/Knowledgator/GLinker)
266
+ - **Model Collection**: [HuggingFace Collection](https://hf.co/collections/knowledgator/gliner-bi-v2)
267
+
268
+ ---
269
+
270
+ **Knowledgator Engineering © 2026**