Whisper Telephony AMD (Answering Machine Detection)

A real-time audio classifier that detects whether a telephony call is answered by a human, voicemail, IVR system, or answering machine — using Whisper's speech understanding to distinguish human-recorded voicemail greetings from live speech.

Results

98.75% accuracy on 400 test samples with only 5 misclassifications:

                   precision    recall  f1-score   support

            human       1.00      0.99      1.00       114
        voicemail       0.96      0.99      0.98       102
              ivr       1.00      0.99      0.99        92
answering_machine       0.99      0.98      0.98        92

         accuracy                           0.99       400
        macro avg       0.99      0.99      0.99       400
     weighted avg       0.99      0.99      0.99       400

Confusion Matrix (rows = actual, columns = predicted):

	Human	Voicemail	IVR	Answering Machine
Human	113	1	0	0
Voicemail	0	101	0	1
IVR	0	1	91	0
Answering Machine	0	2	0	90

Accuracy Per Epoch

Epoch	Accuracy	Eval Loss	Per-Class
1	98.75%	0.0785	human=99.1%, vm=99.0%, ivr=98.9%, am=97.8%
2	95.75%	0.1473	human=94.7%, vm=93.1%, ivr=97.8%, am=97.8%
3	98.25%	0.0779	human=97.4%, vm=100%, ivr=97.8%, am=97.8%
4	98.75%	0.0415	human=99.1%, vm=99.0%, ivr=98.9%, am=97.8%
5	98.75%	0.0569	human=99.1%, vm=98.0%, ivr=98.9%, am=98.9%
6	98.00%	0.0539	human=97.4%, vm=99.0%, ivr=97.8%, am=97.8%

Early stopping triggered after epoch 6 (patience=5, best at epoch 4). Best model loaded from epoch 4 checkpoint.

Model Details


Architecture	WhisperForAudioClassification (Whisper-Tiny encoder + linear classifier)
Base model	openai/whisper-tiny
Parameters	8.3M total, 7.2M trainable (conv layers frozen)
Input	16kHz mono audio → 80-bin mel spectrogram (30s padded)
Output	4 classes: `human`, `voicemail`, `ivr`, `answering_machine`
Inference speed	~12ms CPU (ONNX int8), <5ms GPU
Model size	31.7 MB (safetensors)
Design reference	Same architecture as pipecat-ai/smart-turn-v3

Quick Start

Pipeline (simplest)

from transformers import pipeline

classifier = pipeline("audio-classification", model="AbijahKaj/whisper-telephony-amd")
result = classifier("phone_call.wav")
print(result)
# [{'score': 0.98, 'label': 'human'}, {'score': 0.01, 'label': 'voicemail'}, ...]

Manual Inference

from transformers import WhisperForAudioClassification, AutoFeatureExtractor
import torch

model = WhisperForAudioClassification.from_pretrained("AbijahKaj/whisper-telephony-amd")
fe = AutoFeatureExtractor.from_pretrained("AbijahKaj/whisper-telephony-amd")

# audio_array: numpy array at 16kHz
inputs = fe(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
    pred = torch.argmax(logits, dim=-1).item()
    label = model.config.id2label[str(pred)]
    print(f"Predicted: {label}")

Streaming Real-Time Inference

from streaming_amd import StreamingAMDClassifier

classifier = StreamingAMDClassifier("AbijahKaj/whisper-telephony-amd")

for pcm_chunk in audio_stream:  # 160ms chunks @ 8kHz
    result = classifier.process_chunk(pcm_chunk)
    if result:
        label, confidence, elapsed_s = result
        print(f"{label} ({confidence:.0%}) after {elapsed_s:.1f}s")
        break

Why Whisper?

Voicemail greetings are recorded by real humans — they are acoustically identical to live speech. Traditional acoustic-only models (energy, pitch, spectral features) cannot reliably distinguish "Hi, I'm not available, leave a message" from "Hello? Who's calling?".

Whisper's encoder was pre-trained on 680K hours of speech and understands what is being said, not just how it sounds. This semantic understanding is critical for AMD.

Training

Dataset

AbijahKaj/telephony-amd-dataset — 8,264 train / 400 test samples, balanced across 4 classes (~2,000 each).

Data sources:

Class	Count	Sources
Human	2,151	PolyAI/minds14 (real telephony callers, 6 languages), pipecat-ai/human_5_all, pipecat-ai/human_convcollector_1, original edge-tts
Voicemail	2,078	pipecat-ai smart-turn rime_2 TTS (personal greeting style), original edge-tts
IVR	2,017	pipecat-ai smart-turn chirp3 TTS (automated system style), original edge-tts
Answering Machine	2,018	pipecat-ai smart-turn orpheus TTS (machine greeting style), original edge-tts

Hyperparameters

Parameter	Value
Learning rate	1e-4
Scheduler	Cosine with 25 warmup steps
Batch size	32
Gradient accumulation	1
Max epochs	20 (early stopped at 6)
Weight decay	0.01
Precision	FP16
Gradient checkpointing	Enabled
Freeze strategy	Conv layers frozen, transformer layers + head trainable
Early stopping patience	5
Max audio length	10s (truncated, padded to 30s for Whisper)
Hardware	Tesla T4 (16GB VRAM)

Framework Versions

Transformers 5.7.0
PyTorch 2.11.0+cu130
Datasets 4.8.5
Tokenizers 0.22.2

Classes

Label	ID	Description	Example
`human`	0	Live person on the phone	"Hello? Yes, who is this?"
`voicemail`	1	Personal voicemail greeting	"Hi, you've reached John. Leave a message after the beep."
`ivr`	2	IVR system / automated menu	"Press 1 for sales, press 2 for support..."
`answering_machine`	3	Carrier/generic automated message	"The number you have dialed is not available..."

Limitations

Trained primarily on English, French, Spanish, and German audio
TTS-generated non-human classes may not fully represent all real-world telephony systems
Best performance on first 10 seconds of audio
Not tested on noisy cellular connections or VoIP codec artifacts beyond telephony bandpass (300-3400Hz)
The model may confuse voicemail greetings with answering machine messages in edge cases (2 misclassifications in test set)

Files

model.safetensors — Model weights (31.7MB)
config.json — Model configuration
preprocessor_config.json — Feature extractor config
streaming_amd.py — Streaming real-time inference module
train_local.py — Training script (CLI args, RTX 5090 ready)

Citation

@misc{whisper-telephony-amd,
  author = {AbijahKaj},
  title = {Whisper Telephony AMD: Real-Time Answering Machine Detection},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/AbijahKaj/whisper-telephony-amd}
}

Downloads last month: 54

Safetensors

Model size

8.31M params

Tensor type

F32

Model tree for AbijahKaj/whisper-telephony-amd

Base model

openai/whisper-tiny

Finetuned

(1838)

this model

Datasets used to train AbijahKaj/whisper-telephony-amd

Evaluation results

Accuracy on telephony-amd-dataset
test set self-reported

0.988
F1 (macro) on telephony-amd-dataset
test set self-reported

0.990
Precision (macro) on telephony-amd-dataset
test set self-reported

0.990
Recall (macro) on telephony-amd-dataset
test set self-reported

0.990