Instructions to use AnxForever/chinese-ai-detector-bert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AnxForever/chinese-ai-detector-bert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="AnxForever/chinese-ai-detector-bert")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("AnxForever/chinese-ai-detector-bert") model = AutoModelForSequenceClassification.from_pretrained("AnxForever/chinese-ai-detector-bert") - Notebooks
- Google Colab
- Kaggle
Chinese AI-Generated Text Detector — BERT v11c (Boundary-Fix)
中文 AI 生成文本检测器(本科毕业设计最终版)
A fine-tuned BERT model that classifies Chinese text as either human-written (0) or AI-generated (1). The main released model is a document-level binary classifier; mixed-text boundary detection is an experimental extension provided by a separate span model.
📌 模型概述 / Overview
中文:本模型是基于 bert-base-chinese 微调的中文 AI 生成文本二分类器,为本科毕业设计「基于 BERT 微调的中文 AI 生成文本检测系统」的最终生产模型(v11c boundary-fix 版本)。当前主链路输出 Human / AI 二分类;[SEP] 边界标记与 Token 级 span detector 是配套的实验性扩展,用于探索构造型人机混写样本中的片段级分析。
English: A binary classifier fine-tuned on bert-base-chinese for Chinese AI-generated text detection. This is the final production checkpoint (v11c boundary-fix) of an undergraduate thesis project. [SEP] boundary markers and the token-level span detector are experimental extensions for constructed human/AI mixed-text analysis, not the default production inference path.
📊 评估指标 / Evaluation
| Dataset | Samples | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| Validation set | 7,452 | 98.75 % | 98.30 % | 99.37 % | 98.83 % |
core_v1_test_clean |
545 | 97.98 % | 97.87 % | 98.77 % | 98.32 % |
| Independent eval (910) | 910 | 98.57 % | 93.08 % | 98.67 % | 95.79 % |
| Three-set average | — | 98.56 % | — | — | — |
The metrics above evaluate the document-level binary classifier. The historical token-level boundary result belongs to the separate
chinese-ai-detector-spanexperimental model and should not be mixed with the main classifier metrics.
Independent eval by source (selected)
| Source | Samples | Accuracy |
|---|---|---|
| Toutiao News (all) | 377 | 100.0 % |
| Wikipedia CN | 119 | 99.16 % |
| formal_collected | 200 | 96.5 % |
| real_ai_gemini-3-pro-preview | 24 | 100.0 % |
| real_ai_deepseek-v3.2 | 8 | 100.0 % |
🏗️ 架构 / Architecture
- Base model:
bert-base-chinese(12 layers, hidden 768, 12 heads, vocab 21,128) - Head:
BertForSequenceClassification(2 labels:0 = human,1 = AI) - Max sequence length: 256 tokens (train), 512 (supported)
- Framework:
transformers 4.57.3, PyTorch 2.0+ - Parameters: ~102M
Training configuration
| Setting | Value |
|---|---|
| Base model | bert-base-chinese (via bert_v7_improved intermediate checkpoint) |
| Train samples | 63,113 |
| Validation samples | 7,452 |
| Epochs | 5 (best at epoch 2) |
| Batch size | 8 × 4 grad accum |
| Learning rate | 1e-5 |
| Label smoothing | 0.05 |
| Max length | 256 |
| Early stopping patience | 2 |
Data changes vs. v10 baseline
- Removed 750 hard patterns + 1,767 unapproved samples + 7 length violations
- Added 300 formal-collected weak-domain samples
- Added 300 Llama-405B weak-domain samples
- Added 2,131 long-AI boundary-fix samples (the key v11c contribution)
- Net change: +207 rows vs. v10
🚀 使用方法 / Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
MODEL_ID = "AnxForever/chinese-ai-detector-bert"
TEMPERATURE = 0.8165 # Temperature scaling, calibrated on 910 samples (ECE=0.0034)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
model.eval()
text = "这是一段需要检测的中文文本。"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
logits = model(**inputs).logits
# Apply temperature scaling for calibrated confidence
probs = torch.softmax(logits / TEMPERATURE, dim=-1)[0]
pred_idx = int(probs.argmax())
label = model.config.id2label[pred_idx] # "human-written" or "AI-generated"
print(f"{label} (confidence: {probs[pred_idx].item():.2%})")
Label mapping
0→ human-written (人类撰写)1→ AI-generated (AI 生成)
Note on Temperature Scaling:
T = 0.8165was calibrated on a held-out 910-sample set and brings ECE from 0.0121 down to 0.0034. For uncalibrated probabilities, setTEMPERATURE = 1.0.
🎯 技术贡献 / Contributions
Data-centric risk governance The v11c model keeps the BERT backbone fixed and improves robustness through data cleaning, weak-domain supplementation, long-AI supplementation, and calibrated inference.
[SEP]boundary-marker experiment In constructed C2-style mixed samples,[SEP]was used as an explicit boundary hint between known human and AI segments. This is an engineering experiment for mixed-text modeling, not a claim that[SEP]itself can identify authorship without labels.Two-stage experimental extension
- Stage 1: this model — document-level Human / AI classification
- Stage 2: separate span detector — token-level Human / AI tagging on mixed-text samples
- See
AnxForever/chinese-ai-detector-span
Long-AI boundary-fix (v11c) 针对长 AI 段落在边界处易被误判的问题,补充 2,131 条长 AI 边界样本,使 256+ token 桶的准确率恢复到 V10 水平。
Note on mixed-text boundary detection
The boundary module was trained on a relatively small constructed mixed-text set. It is useful for demonstration, teaching, and secondary development, but it should be treated as an experimental prototype. For real business scenarios, mixed human/AI data from the target domain should be collected, labeled, retrained, and evaluated before deployment.
⚠️ 局限性 / Limitations
- 仅针对中文文本;对英文或其他语言无保证。
- 训练语料偏新闻/百科/技术/正式文体,对诗歌、古文、社交媒体短文本可能欠拟合。
- 当前默认发布能力是篇章级二分类;人机混写边界定位属于实验性扩展,不建议直接作为商业审核结论。
- 训练数据主要来自 DeepSeek、Gemini、GPT、Llama-405B 等主流模型;对经过重度改写的 AI 文本仍有遗漏风险。
- 对短文本、强人工改写文本、多次交替混写文本和目标域外文本,不保证固定准确率。
🗂️ 相关资源 / Related
- 📊 训练数据集 / Dataset:
AnxForever/chinese-ai-detection-dataset - 🎯 边界检测器 / Span detector:
AnxForever/chinese-ai-detector-span
📜 License
MIT License
✍️ Citation
@misc{anxforever2026chineseaidetectorbert,
title = {Chinese AI-Generated Text Detector with Boundary Markers (BERT v11c)},
author = {AnxForever},
year = {2026},
howpublished = {\url{https://huggingface.co/AnxForever/chinese-ai-detector-bert}},
note = {Undergraduate thesis project}
}
- Downloads last month
- 1,265
Model tree for AnxForever/chinese-ai-detector-bert
Base model
google-bert/bert-base-chineseEvaluation results
- Accuracy on Validation Setvalidation set self-reported0.988
- F1 on Validation Setvalidation set self-reported0.988
- Accuracy on Independent Evaluation (910 samples)test set self-reported0.986
- F1 on Independent Evaluation (910 samples)test set self-reported0.958
- Accuracy on Three-Set Averageself-reported0.986