ngoc commited on
Commit
937bbbb
·
1 Parent(s): 6f04821

update readme and license

Browse files
Files changed (2) hide show
  1. LICENSE.md +5 -5
  2. README.md +15 -12
LICENSE.md CHANGED
@@ -1,7 +1,7 @@
1
- # **ProtonX Text Correction Model LICENSE AGREEMENT (v1.2-NC)**
2
 
3
 
4
- **Effective Date:** 21 November 2025
5
 
6
  **Copyright Holder:** PROTONX TECHNOLOGY COMPANY LIMITED
7
 
@@ -16,8 +16,8 @@ WHEREAS, Licensor has developed the ProtonX Text Correction model and intends to
16
  WHEREAS, traditional open-source licenses (e.g., MIT) do not fully address complexities specific to AI language models—including model weights, fine-tuning data, privacy of user text, and downstream liabilities;
17
 
18
  NOW, THEREFORE, the parties agree as follows.
19
- Where this Agreement conflicts with the MIT License, the MIT License prevails.
20
- Where MIT is silent, this Agreement supplements it.
21
 
22
  ---
23
 
@@ -41,7 +41,7 @@ Where MIT is silent, this Agreement supplements it.
41
  (d) documentation, metadata, and configuration files
42
 
43
  The authoritative version is hosted at:
44
- **[hhttps://github.com/protonx-engineering/protonx-text-correction](hhttps://github.com/protonx-engineering/protonx-text-correction)**
45
 
46
  ### **1.4 Outputs**
47
 
 
1
+ # **ProtonX Text Correction Model LICENSE AGREEMENT (v1.3-NC)**
2
 
3
 
4
+ **Effective Date:** 27 November 2025
5
 
6
  **Copyright Holder:** PROTONX TECHNOLOGY COMPANY LIMITED
7
 
 
16
  WHEREAS, traditional open-source licenses (e.g., MIT) do not fully address complexities specific to AI language models—including model weights, fine-tuning data, privacy of user text, and downstream liabilities;
17
 
18
  NOW, THEREFORE, the parties agree as follows.
19
+ The MIT License applies solely to code components.
20
+ Model weights, fine-tuned results, and Outputs are licensed under this Agreement.
21
 
22
  ---
23
 
 
41
  (d) documentation, metadata, and configuration files
42
 
43
  The authoritative version is hosted at:
44
+ **[https://github.com/protonx-engineering/protonx-text-correction](hhttps://github.com/protonx-engineering/protonx-text-correction)**
45
 
46
  ### **1.4 Outputs**
47
 
README.md CHANGED
@@ -14,7 +14,7 @@ language:
14
  </p>
15
 
16
  <h1 align="center">
17
- High-Accuracy Vietnamese Legal Document Correction
18
  </h1>
19
 
20
  [![GitHub](https://img.shields.io/badge/ProtonX-GitHub-black?logo=github)](https://github.com/protonx-engineering/protonx-text-correction)
@@ -27,11 +27,12 @@ High-Accuracy Vietnamese Legal Document Correction
27
 
28
  ## **Introduction**
29
 
30
- ### **ProtonX Legal Text Correction (v1.2-NC)**
31
 
32
- A **specialized Vietnamese correction model** engineered for **high-accuracy OCR post-processing**, especially **to fix noisy PaddleOCR outputs** in enterprise and legal workflows.
 
 
33
 
34
- #### **Best Use Case (Primary Focus)**: **Fixing PaddleOCR text errors**
35
 
36
  <img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">
37
 
@@ -61,17 +62,19 @@ Strict constraints ensure:
61
 
62
  ## **LICENSE**
63
 
64
- This model is released under the ProtonX Text Correction Model License (v1.2-NC).
65
 
66
  See [LICENSE.md](./LICENSE.md) for full terms, conditions, and usage restrictions.
67
 
68
- ## **Highlights**
69
 
70
 
71
- 1. **ROUGE-L: 98.44**
72
- - Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
73
 
74
 
 
 
 
75
 
76
 
77
  ---
@@ -107,7 +110,7 @@ for text in examples:
107
  outputs = model.generate(
108
  **inputs,
109
  num_beams=10,
110
- max_new_tokens=32,
111
  length_penalty=1.0,
112
  early_stopping=True,
113
  repetition_penalty=1.2,
@@ -131,7 +134,7 @@ for text in examples:
131
 
132
  | Metric | Score |
133
  | ------------- | --------- |
134
- | **ROUGE-L** | **98.44** |
135
 
136
  ---
137
 
@@ -141,7 +144,7 @@ for text in examples:
141
  * Model: Seq2Seq Transformer
142
  * Legal-domain augmentation
143
  * Beam search decoding
144
- * Max sequence length: 64 tokens total (32 tokens for input and 32 tokens for output).
145
  * High-precision diacritic + punctuation restoration
146
 
147
  ### Domain Coverage:
@@ -194,7 +197,7 @@ Căn cứ Hiến pháp nước Cộng hòa xã hội chủ nghĩa Việt Nam;
194
  ## **Future Work**
195
 
196
  * Achieving even higher ROUGE-L performance on legal-domain datasets
197
- * Extending maximum sequence length from 64 to 256 tokens for long-clause legal documents
198
  ---
199
 
200
  ## **Acknowledgments**
 
14
  </p>
15
 
16
  <h1 align="center">
17
+ High-Accuracy Vietnamese Text Correction v1.3
18
  </h1>
19
 
20
  [![GitHub](https://img.shields.io/badge/ProtonX-GitHub-black?logo=github)](https://github.com/protonx-engineering/protonx-text-correction)
 
27
 
28
  ## **Introduction**
29
 
30
+ <img src="https://storage.googleapis.com/mle-courses-prod/users/61b6fa1ba83a7e37c8309756/private-files/1795a9d0-cb4d-11f0-a59b-27096d42dd86-Screen_Shot_2025-11-27_at_11.53.12.png">
31
 
32
+ ### **ProtonX Text Correction (v1.3-NC)**
33
+
34
+ A specialized Vietnamese text correction model engineered for high-accuracy normalization of legal and enterprise text. Optimized for OCR post-processing (including PaddleOCR outputs), but also capable of cleaning broader Vietnamese text with diacritic restoration, segmentation repair, and correction of domain-specific terminology.
35
 
 
36
 
37
  <img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">
38
 
 
62
 
63
  ## **LICENSE**
64
 
65
+ This model is released under the ProtonX Text Correction Model License (v1.3-NC).
66
 
67
  See [LICENSE.md](./LICENSE.md) for full terms, conditions, and usage restrictions.
68
 
69
+ ## **Current Version**: v1.3
70
 
71
 
72
+ ## **Highlights**
 
73
 
74
 
75
+ 1. **ROUGE-L: Coming soon**
76
+ - Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
77
+ - Extended maximum sequence length from 32 tokens in v1.2 to 128 tokens in this release.
78
 
79
 
80
  ---
 
110
  outputs = model.generate(
111
  **inputs,
112
  num_beams=10,
113
+ max_new_tokens=128,
114
  length_penalty=1.0,
115
  early_stopping=True,
116
  repetition_penalty=1.2,
 
134
 
135
  | Metric | Score |
136
  | ------------- | --------- |
137
+ | **ROUGE-L** | **Coming soon** |
138
 
139
  ---
140
 
 
144
  * Model: Seq2Seq Transformer
145
  * Legal-domain augmentation
146
  * Beam search decoding
147
+ * Max sequence length: 256 tokens total (128 tokens for input and 128 tokens for output).
148
  * High-precision diacritic + punctuation restoration
149
 
150
  ### Domain Coverage:
 
197
  ## **Future Work**
198
 
199
  * Achieving even higher ROUGE-L performance on legal-domain datasets
200
+ * Extending maximum sequence length from 128 to 1024 tokens for long-clause legal documents
201
  ---
202
 
203
  ## **Acknowledgments**