protonx-models
/

protonx-legal-tc

Safetensors

Vietnamese

protonx-text-correction

text-to-text

Model card Files Files and versions

xet

Community

ngoc commited on 14 days ago

Commit

937bbbb

1 Parent(s): 6f04821

update readme and license

Browse files

Files changed (2) hide show

LICENSE.md +5 -5
README.md +15 -12

LICENSE.md CHANGED Viewed

@@ -1,7 +1,7 @@
-# **ProtonX Text Correction Model LICENSE AGREEMENT (v1.2-NC)**
-**Effective Date:** 21 November 2025
 **Copyright Holder:** PROTONX TECHNOLOGY COMPANY LIMITED
@@ -16,8 +16,8 @@ WHEREAS, Licensor has developed the ProtonX Text Correction model and intends to
 WHEREAS, traditional open-source licenses (e.g., MIT) do not fully address complexities specific to AI language models—including model weights, fine-tuning data, privacy of user text, and downstream liabilities;
 NOW, THEREFORE, the parties agree as follows.
-Where this Agreement conflicts with the MIT License, the MIT License prevails.
-Where MIT is silent, this Agreement supplements it.
 ---
@@ -41,7 +41,7 @@ Where MIT is silent, this Agreement supplements it.
 (d) documentation, metadata, and configuration files
 The authoritative version is hosted at:
-**[hhttps://github.com/protonx-engineering/protonx-text-correction](hhttps://github.com/protonx-engineering/protonx-text-correction)**
 ### **1.4 Outputs**

+# **ProtonX Text Correction Model LICENSE AGREEMENT (v1.3-NC)**
+**Effective Date:** 27 November 2025
 **Copyright Holder:** PROTONX TECHNOLOGY COMPANY LIMITED
 WHEREAS, traditional open-source licenses (e.g., MIT) do not fully address complexities specific to AI language models—including model weights, fine-tuning data, privacy of user text, and downstream liabilities;
 NOW, THEREFORE, the parties agree as follows.
+The MIT License applies solely to code components.
+Model weights, fine-tuned results, and Outputs are licensed under this Agreement.
 ---
 (d) documentation, metadata, and configuration files
 The authoritative version is hosted at:
+**[https://github.com/protonx-engineering/protonx-text-correction](hhttps://github.com/protonx-engineering/protonx-text-correction)**
 ### **1.4 Outputs**

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ language:
 </p>
 <h1 align="center">
-High-Accuracy Vietnamese Legal Document Correction
 </h1>
 [![GitHub](https://img.shields.io/badge/ProtonX-GitHub-black?logo=github)](https://github.com/protonx-engineering/protonx-text-correction)
@@ -27,11 +27,12 @@ High-Accuracy Vietnamese Legal Document Correction
 ## **Introduction**
-### **ProtonX Legal Text Correction (v1.2-NC)**
-A **specialized Vietnamese correction model** engineered for **high-accuracy OCR post-processing**, especially **to fix noisy PaddleOCR outputs** in enterprise and legal workflows.
-#### **Best Use Case (Primary Focus)**: **Fixing PaddleOCR text errors**
 <img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">
@@ -61,17 +62,19 @@ Strict constraints ensure:
 ## **LICENSE**
-This model is released under the ProtonX Text Correction Model License (v1.2-NC).
 See [LICENSE.md](./LICENSE.md) for full terms, conditions, and usage restrictions.
-## **Highlights**
-1. **ROUGE-L: 98.44**
-- Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
 ---
@@ -107,7 +110,7 @@ for text in examples:
         outputs = model.generate(
             **inputs,
             num_beams=10,
-            max_new_tokens=32,
             length_penalty=1.0,
             early_stopping=True,
             repetition_penalty=1.2,
@@ -131,7 +134,7 @@ for text in examples:
 | Metric        | Score     |
 | ------------- | --------- |
-| **ROUGE-L**   | **98.44** |
 ---
@@ -141,7 +144,7 @@ for text in examples:
 * Model: Seq2Seq Transformer
 * Legal-domain augmentation
 * Beam search decoding
-* Max sequence length: 64 tokens total (32 tokens for input and 32 tokens for output).
 * High-precision diacritic + punctuation restoration
 ### Domain Coverage:
@@ -194,7 +197,7 @@ Căn cứ Hiến pháp nước Cộng hòa xã hội chủ nghĩa Việt Nam;
 ## **Future Work**
 * Achieving even higher ROUGE-L performance on legal-domain datasets
-* Extending maximum sequence length from 64 to 256 tokens for long-clause legal documents
 ---
 ## **Acknowledgments**

 </p>
 <h1 align="center">
+High-Accuracy Vietnamese Text Correction v1.3
 </h1>
 [![GitHub](https://img.shields.io/badge/ProtonX-GitHub-black?logo=github)](https://github.com/protonx-engineering/protonx-text-correction)
 ## **Introduction**
+<img src="https://storage.googleapis.com/mle-courses-prod/users/61b6fa1ba83a7e37c8309756/private-files/1795a9d0-cb4d-11f0-a59b-27096d42dd86-Screen_Shot_2025-11-27_at_11.53.12.png">
+### **ProtonX Text Correction (v1.3-NC)**
+A specialized Vietnamese text correction model engineered for high-accuracy normalization of legal and enterprise text. Optimized for OCR post-processing (including PaddleOCR outputs), but also capable of cleaning broader Vietnamese text with diacritic restoration, segmentation repair, and correction of domain-specific terminology.
 <img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">
 ## **LICENSE**
+This model is released under the ProtonX Text Correction Model License (v1.3-NC).
 See [LICENSE.md](./LICENSE.md) for full terms, conditions, and usage restrictions.
+## **Current Version**: v1.3
+## **Highlights**
+1. **ROUGE-L: Coming soon**
+- Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
+- Extended maximum sequence length from 32 tokens in v1.2 to 128 tokens in this release.
 ---
         outputs = model.generate(
             **inputs,
             num_beams=10,
+            max_new_tokens=128,
             length_penalty=1.0,
             early_stopping=True,
             repetition_penalty=1.2,
 | Metric        | Score     |
 | ------------- | --------- |
+| **ROUGE-L**   | **Coming soon** |
 ---
 * Model: Seq2Seq Transformer
 * Legal-domain augmentation
 * Beam search decoding
+* Max sequence length: 256 tokens total (128 tokens for input and 128 tokens for output).
 * High-precision diacritic + punctuation restoration
 ### Domain Coverage:
 ## **Future Work**
 * Achieving even higher ROUGE-L performance on legal-domain datasets
+* Extending maximum sequence length from 128 to 1024 tokens for long-clause legal documents
 ---
 ## **Acknowledgments**