Instructions to use AbijahKaj/whisper-telephony-amd with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AbijahKaj/whisper-telephony-amd with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("audio-classification", model="AbijahKaj/whisper-telephony-amd")# Load model directly from transformers import AutoProcessor, AutoModelForAudioClassification processor = AutoProcessor.from_pretrained("AbijahKaj/whisper-telephony-amd") model = AutoModelForAudioClassification.from_pretrained("AbijahKaj/whisper-telephony-amd") - Notebooks
- Google Colab
- Kaggle
Whisper Telephony AMD (Answering Machine Detection)
A real-time audio classifier that detects whether a telephony call is answered by a human, voicemail, IVR system, or answering machine — using Whisper's speech understanding to distinguish human-recorded voicemail greetings from live speech.
Results
98.75% accuracy on 400 test samples with only 5 misclassifications:
precision recall f1-score support
human 1.00 0.99 1.00 114
voicemail 0.96 0.99 0.98 102
ivr 1.00 0.99 0.99 92
answering_machine 0.99 0.98 0.98 92
accuracy 0.99 400
macro avg 0.99 0.99 0.99 400
weighted avg 0.99 0.99 0.99 400
Confusion Matrix (rows = actual, columns = predicted):
| Human | Voicemail | IVR | Answering Machine | |
|---|---|---|---|---|
| Human | 113 | 1 | 0 | 0 |
| Voicemail | 0 | 101 | 0 | 1 |
| IVR | 0 | 1 | 91 | 0 |
| Answering Machine | 0 | 2 | 0 | 90 |
Accuracy Per Epoch
| Epoch | Accuracy | Eval Loss | Per-Class |
|---|---|---|---|
| 1 | 98.75% | 0.0785 | human=99.1%, vm=99.0%, ivr=98.9%, am=97.8% |
| 2 | 95.75% | 0.1473 | human=94.7%, vm=93.1%, ivr=97.8%, am=97.8% |
| 3 | 98.25% | 0.0779 | human=97.4%, vm=100%, ivr=97.8%, am=97.8% |
| 4 | 98.75% | 0.0415 | human=99.1%, vm=99.0%, ivr=98.9%, am=97.8% |
| 5 | 98.75% | 0.0569 | human=99.1%, vm=98.0%, ivr=98.9%, am=98.9% |
| 6 | 98.00% | 0.0539 | human=97.4%, vm=99.0%, ivr=97.8%, am=97.8% |
Early stopping triggered after epoch 6 (patience=5, best at epoch 4). Best model loaded from epoch 4 checkpoint.
Model Details
| Architecture | WhisperForAudioClassification (Whisper-Tiny encoder + linear classifier) |
| Base model | openai/whisper-tiny |
| Parameters | 8.3M total, 7.2M trainable (conv layers frozen) |
| Input | 16kHz mono audio → 80-bin mel spectrogram (30s padded) |
| Output | 4 classes: human, voicemail, ivr, answering_machine |
| Inference speed | ~12ms CPU (ONNX int8), <5ms GPU |
| Model size | 31.7 MB (safetensors) |
| Design reference | Same architecture as pipecat-ai/smart-turn-v3 |
Quick Start
Pipeline (simplest)
from transformers import pipeline
classifier = pipeline("audio-classification", model="AbijahKaj/whisper-telephony-amd")
result = classifier("phone_call.wav")
print(result)
# [{'score': 0.98, 'label': 'human'}, {'score': 0.01, 'label': 'voicemail'}, ...]
Manual Inference
from transformers import WhisperForAudioClassification, AutoFeatureExtractor
import torch
model = WhisperForAudioClassification.from_pretrained("AbijahKaj/whisper-telephony-amd")
fe = AutoFeatureExtractor.from_pretrained("AbijahKaj/whisper-telephony-amd")
# audio_array: numpy array at 16kHz
inputs = fe(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
label = model.config.id2label[str(pred)]
print(f"Predicted: {label}")
Streaming Real-Time Inference
from streaming_amd import StreamingAMDClassifier
classifier = StreamingAMDClassifier("AbijahKaj/whisper-telephony-amd")
for pcm_chunk in audio_stream: # 160ms chunks @ 8kHz
result = classifier.process_chunk(pcm_chunk)
if result:
label, confidence, elapsed_s = result
print(f"{label} ({confidence:.0%}) after {elapsed_s:.1f}s")
break
Why Whisper?
Voicemail greetings are recorded by real humans — they are acoustically identical to live speech. Traditional acoustic-only models (energy, pitch, spectral features) cannot reliably distinguish "Hi, I'm not available, leave a message" from "Hello? Who's calling?".
Whisper's encoder was pre-trained on 680K hours of speech and understands what is being said, not just how it sounds. This semantic understanding is critical for AMD.
Training
Dataset
AbijahKaj/telephony-amd-dataset — 8,264 train / 400 test samples, balanced across 4 classes (~2,000 each).
Data sources:
| Class | Count | Sources |
|---|---|---|
| Human | 2,151 | PolyAI/minds14 (real telephony callers, 6 languages), pipecat-ai/human_5_all, pipecat-ai/human_convcollector_1, original edge-tts |
| Voicemail | 2,078 | pipecat-ai smart-turn rime_2 TTS (personal greeting style), original edge-tts |
| IVR | 2,017 | pipecat-ai smart-turn chirp3 TTS (automated system style), original edge-tts |
| Answering Machine | 2,018 | pipecat-ai smart-turn orpheus TTS (machine greeting style), original edge-tts |
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 1e-4 |
| Scheduler | Cosine with 25 warmup steps |
| Batch size | 32 |
| Gradient accumulation | 1 |
| Max epochs | 20 (early stopped at 6) |
| Weight decay | 0.01 |
| Precision | FP16 |
| Gradient checkpointing | Enabled |
| Freeze strategy | Conv layers frozen, transformer layers + head trainable |
| Early stopping patience | 5 |
| Max audio length | 10s (truncated, padded to 30s for Whisper) |
| Hardware | Tesla T4 (16GB VRAM) |
Framework Versions
- Transformers 5.7.0
- PyTorch 2.11.0+cu130
- Datasets 4.8.5
- Tokenizers 0.22.2
Classes
| Label | ID | Description | Example |
|---|---|---|---|
human |
0 | Live person on the phone | "Hello? Yes, who is this?" |
voicemail |
1 | Personal voicemail greeting | "Hi, you've reached John. Leave a message after the beep." |
ivr |
2 | IVR system / automated menu | "Press 1 for sales, press 2 for support..." |
answering_machine |
3 | Carrier/generic automated message | "The number you have dialed is not available..." |
Limitations
- Trained primarily on English, French, Spanish, and German audio
- TTS-generated non-human classes may not fully represent all real-world telephony systems
- Best performance on first 10 seconds of audio
- Not tested on noisy cellular connections or VoIP codec artifacts beyond telephony bandpass (300-3400Hz)
- The model may confuse voicemail greetings with answering machine messages in edge cases (2 misclassifications in test set)
Files
model.safetensors— Model weights (31.7MB)config.json— Model configurationpreprocessor_config.json— Feature extractor configstreaming_amd.py— Streaming real-time inference moduletrain_local.py— Training script (CLI args, RTX 5090 ready)
Citation
@misc{whisper-telephony-amd,
author = {AbijahKaj},
title = {Whisper Telephony AMD: Real-Time Answering Machine Detection},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/AbijahKaj/whisper-telephony-amd}
}
- Downloads last month
- 54
Model tree for AbijahKaj/whisper-telephony-amd
Base model
openai/whisper-tinyDatasets used to train AbijahKaj/whisper-telephony-amd
pipecat-ai/smart-turn-data-v3.2-train
AbijahKaj/telephony-amd-dataset
Evaluation results
- Accuracy on telephony-amd-datasettest set self-reported0.988
- F1 (macro) on telephony-amd-datasettest set self-reported0.990
- Precision (macro) on telephony-amd-datasettest set self-reported0.990
- Recall (macro) on telephony-amd-datasettest set self-reported0.990