localm / plans /2025-08-23-webllm-integration /3-assessment-revised.md
mihailik's picture
Adding WebLLM backend as an option.
57f49ec

WebLLM Integration: Revised Simple Plans

Executive Summary

After reviewing the critique of our initial over-engineered plans, this document presents three focused, minimal approaches for integrating WebLLM with Transformers.js fallback. Each plan prioritizes simple if-statements over complex abstractions, minimal code footprint, and easy maintenance.

Core Requirement Recap:

  • Attempt WebLLM first when loading a model
  • If WebLLM load fails, fallback to Transformers.js
  • At inference time, route to appropriate backend based on loaded model type

Plan A: Inline WebLLM Integration (Simplest)

Goal

Integrate WebLLM into the existing ModelCache class with absolute minimum code changes, preserving all existing functionality while adding optimistic WebLLM loading with Transformers.js fallback.

Philosophy

Add WebLLM directly into the existing ModelCache class with minimal changes. No new abstractions, no registries - just straightforward if-else logic in the existing loading flow.

Design Restrictions & Limitations

  • No new files: All changes must be within existing files
  • No interface changes: External API must remain identical
  • No breaking changes: Existing Transformers.js behavior must be preserved exactly
  • Minimal dependencies: Only add WebLLM import, no additional libraries
  • No configuration: Keep WebLLM model selection automatic based on model name

Intentional Design Choices

  1. Inline over modular: Accept some code duplication to avoid abstraction complexity
  2. Cache WebLLM availability check: Prevent repeated import attempts
  3. Identical inference interface: Wrap WebLLM to match Transformers.js pipeline signature
  4. Silent fallback: Log WebLLM failures but don't surface them to UI
  5. No WebLLM-specific features: Stick to basic text generation only

Risk Assessment

High Risks

  • WebLLM import failures: Dynamic imports may fail unpredictably
    • Mitigation: Robust try-catch with cached failure state
  • Memory leaks: WebLLM engines may not dispose properly
    • Mitigation: Store engine reference for explicit cleanup
  • Interface mismatch: WebLLM API differs significantly from Transformers.js
    • Mitigation: Careful wrapper function with identical signatures

Medium Risks

  • Performance regression: Additional async checks may slow loading
    • Mitigation: Cache availability check result
  • Error message confusion: Users may see Transformers.js errors when WebLLM was attempted
    • Mitigation: Clear logging distinguishing between attempts

Low Risks

  • Code maintainability: Inline logic may become hard to follow
    • Mitigation: Comprehensive comments and clear variable naming

Potential Pitfalls & Avoidance

  1. Pitfall: WebLLM models have different naming conventions

    • Avoidance: Start with exact model name matching, document differences
  2. Pitfall: WebLLM may load but fail during inference

    • Avoidance: Include inference test during model loading phase
  3. Pitfall: Mixed backend state confusion in cache

    • Avoidance: Clear backend type marking on cached models
  4. Pitfall: WebLLM engine disposal not called

    • Avoidance: Store engine reference and implement cleanup in cache eviction

Implementation Details

Step 1: Add WebLLM availability detection (no code)

Goal: cheaply detect whether WebLLM is present in the runtime and cache that result.

Details: perform a single availability probe once (at worker startup or first model load). Cache the boolean result and last error text for diagnostics. The probe must be cheap and non-blocking for the UI thread; if the probe indicates WebLLM is unavailable, skip WebLLM attempts for the remainder of the session.

Risks and safeguards: the probe can succeed but engine creation still fail β€” store the probe result as advisory only, and always run a short, bounded health check when creating an actual engine. Log errors for debugging; do not surface probe failures to the user.

Step 2: Optimistic WebLLM attempt with bounded validation (no code)

Goal: when loading a model, try WebLLM first if the probe passed. If WebLLM load or a short validation inference fails, fall back to Transformers.js immediately.

Details: implement a two-phase load: (1) an optimistic WebLLM engine creation attempt with a short, fixed timeout and a lightweight validation (one small inference). If both complete successfully, mark the cached model as WebLLM-backed. If either times out or fails, swallow the error, log it, and run the existing Transformers.js loader unchanged.

Risks and safeguards: protect against long hangs by enforcing a timeout; validate the engine with a minimal inference to catch silent failures; keep errors in logs only and preserve original Transformers.js errors for UI-facing messages.

Step 3: Keep a simple backend marker and unified inference contract (no code)

Goal: mark each cached model with a tiny backend flag ("webllm" or "transformers") so runtime code can choose the correct inference path with one if-statement.

Details: ensure the wrapper used for WebLLM returns data in the same shape the rest of the code expects (a small set of fields). At inference time, do a single if-check on the backend marker and call the model accordingly. Map or normalize options conservatively.

Risks and safeguards: mapping may omit uncommon options β€” limit supported options initially and document them; validate return shape after inference and fall back to a meaningful error if malformed.

Step 4: Minimal disposal lifecycle and cleanup (no code)

Goal: provide an explicit, per-model disposal pathway to avoid leaking native resources.

Details: when a model is evicted or the worker shuts down, if the cached item exposes a disposal API, call it within try/catch and continue. Log disposal outcomes for visibility. Avoid automatic aggressive disposal during active inference.

Risks and safeguards: some engines may not implement disposal correctly β€” wrap calls in try/catch; avoid disposing in the middle of an active request; provide a single manual cleanup call for diagnostic use.

Testing, rollout and rollback (no code)

Testing: verify the following scenarios: WebLLM absent, WebLLM present but engine creation fails, WebLLM present and operates correctly, mixed model usage. Create deterministic mocks for each path for CI.

Rollout: feature-flag the WebLLM probe or gate the code behind a simple config toggle. For rollback, disable the probe; behavior returns to the previous Transformers.js-only flow.

Monitoring and diagnostics: log backend selection, load durations, timeouts, and validation failures to the console (or a dev-only telemetry sink). Provide a developer-only command to view cached model backends and last probe errors.

Implementation Time: 1-2 days

Risk Level: Very Low

Maintenance Overhead: Minimal

Why this works: describes the same minimal inline integration without code; keeps the runtime simple and easily debuggable.


Plan B: Dual-Path ModelCache (Balanced)

Goal

Create a deliberately clear but still small separation between the WebLLM and Transformers.js loading paths to make debugging and testing easier, while keeping a single public interface for callers.

Philosophy

Prefer explicit separate loader functions for each backend, but keep them private to the ModelCache. Use a single public getModel API and a tiny modelBackends registry (mapping modelName -> backend) for diagnostics.

Design Restrictions & Limitations

  • The public ModelCache interface must not change.
  • No large new frameworks, no registry/adapter abstractions beyond the single Map that records backend per model.
  • WebLLM attempts remain optimistic and short-lived; Transformers.js remains the reliable fallback.

Intentional Design Choices

  1. Separate loader functions for clarity and testability.
  2. A small map to track which backend served each model for diagnostics only.
  3. Structured error objects so we can decide what to surface to the UI vs. what to log for debugging.

Risks and mitigations

  • State synchronization: update cache and backend map in an atomic sequence so they cannot diverge. If a cached entry is a pending Promise, ensure the map only records the backend after the Promise resolves successfully.
  • Complexity in tests: provide mocks for each loader and test all four combinations (webllm success/fail x transformers success/fail).

Step-by-step rollout (no code)

  1. Add a modelBackends map to the cache implementation.
  2. Implement two private loader routines: one for WebLLM and one for Transformers.js. Keep the WebLLM loader conservative: timeout, one validation call, wrap engine in a normalized interface.
  3. In the public loader, call the WebLLM loader first; on success update cache and modelBackends to "webllm". On failure, call Transformers.js loader and update modelBackends to "transformers".
  4. Ensure the public getModel returns the same shape regardless of backend.
  5. Add lightweight diagnostics: expose a developer method to list cached models with their backends and last load durations.

Testing, rollout and rollback

Testing: add unit tests for both loader functions with mocks; add integration tests that exercise the public getModel in all backend success/failure permutations. Rollout: can be enabled behind a config flag or staged to a small percentage of users (developer-only first). Rollback is simply disabling WebLLM attempts or reverting the map updates.

Implementation Time: 2-3 days

Risk Level: Low

Maintenance Overhead: Low

Why this works: Slightly more structure than Plan A simplifies debugging and testing while still avoiding large abstractions.


Plan C: Minimal WebLLM Module (Most Structured)

Goal

Extract WebLLM integration into one small, well-tested module that mirrors the existing Transformers.js contract. Keep the rest of the codebase unchanged and use the module from ModelCache when appropriate.

Philosophy

Encapsulate WebLLM specifics (probe, engine creation, validation, disposal) in a single file. That file exposes a tiny API: availability probe, loadModel(modelName) returning a normalized pipeline, and optional dispose methods.

Design Restrictions & Limitations

  • Add exactly one new file/module; do not add registries or dispatch systems.
  • The module must be lightweight, with no complex state beyond a cached availability flag and per-engine handles.
  • The module must normalize outputs to the existing pipeline shape used by the rest of the app.

Intentional Design Choices

  1. Single responsibility: only WebLLM concerns go into the module.
  2. Identical interface: consumers should not need to know whether they call WebLLM or Transformers.js.
  3. Easier testing: the module can be mocked in unit tests without touching Transformers.js code.

Step-by-step rollout (no code)

  1. Create the WebLLM module with three exported functions: probeAvailability(), loadModel(modelName), disposeEngine(handle).
  2. probeAvailability runs a single cheap probe and caches the result for the session.
  3. loadModel attempts engine creation with a bounded timeout and runs a minimal validation inference, returning a normalized pipeline-like object on success or throwing on failure.
  4. Import the module into ModelCache and attempt to use it first; when it throws or times out, fall back to the existing Transformers.js loader.

Risks and mitigations

  • Module drift: keep the module intentionally tiny so API changes are rare.
  • Duplicate normalization logic: ensure the normalization contract is documented and shared between module and cache tests.

Testing, rollout and rollback

Testing: unit test the module aggressively (mocks for engine creation and inference). Integration test that ModelCache interacts with the module correctly and still falls back. Rollout: feature flag the module usage. For rollback, remove the module import or disable the probe call.

Implementation Time: 2-3 days

Risk Level: Low

Maintenance Overhead: Low

Why this works: Clean separation for future growth while keeping runtime and surface area small.


Comparison and Recommendation

Aspect Plan A: Inline Plan B: Dual-Path Plan C: Module
Code Lines Added ~50 (no new files) ~80 (no new files) ~60 (+1 small module)
New Files 0 0 1
Debugging Ease Excellent Good Good
Testing Isolation Hard Medium Easy
Future Extensibility Limited Medium Good
Risk of Breaking Existing Very Low Low Very Low

Recommended Approach: Plan A (Inline WebLLM Integration)

Reasoning:

  • Fastest to ship with minimal risk.
  • Keeps logic local and obvious (one if-statement to route inference).
  • Easy rollback and minimal maintenance burden.

Operational advice: implement Plan A first, run the tests and collect diagnostics. If WebLLM proves stable and valuable, refactor to Plan C for better testability and maintenance.

All three plans keep to the original requirement: optimistic WebLLM load, Transformers.js fallback, and simple runtime routing at inference time, while avoiding large registries, adapters, or orchestration layers.