Spaces:

oyinbo
/

localm

Configuration error

App Files Files Community

localm / plans /2025-08-23-webllm-integration /4-chosen-plan-simple.md

mihailik

Adding WebLLM backend as an option.

57f49ec 4 months ago

preview code

raw

history blame contribute delete

4.97 kB

Chosen Plan: Inline WebLLM Integration — Tight Execution Roadmap

Reinforcement (start)

Keep changes tiny, readable, and modern: concise async/await, arrow functions, optional chaining, nullish coalescing. Do not swallow exceptions — when a catch occurs, act: log a concise diagnostic and take the fallback path or rethrow if necessary.

Core intent

Optimistically use WebLLM when available; when it fails use the existing Transformers.js pipeline. Use a single cache (Promise or resolved engine) as the source of truth. No persistent flags, no timeouts on engine creation, no extra shutdown disposal, minimal logging.

Essentials checklist

Probe the runtime once (advisory only) to avoid pointless attempts on unsupported platforms.
On model load, try WebLLM first when probe suggests possible; if it fails, immediately and deterministically fall back to Transformers.js.
Cache the in-progress Promise and then the resolved engine object in the same cache used today for Transformers.js pipelines.
Decide backend at runtime by inspecting the resolved cached object (duck-typing), not by reading separate per-model flags.
Keep logging minimal and actionable (one-line load start/success/fail, inference fail).
Do not impose timeouts on engine creation; allow large models to finish loading.
Do not add shutdown dispose hooks; worker shutdown will clean up resources.

Steps (order of implementation) with success criteria

Add a cheap advisory probe (in-memory)

What: perform a single, lightweight probe at first load attempt to detect presence of WebLLM APIs; cache boolean and last error in-memory.
Why: skip obviously impossible attempts on unsupported platforms without preventing valid loads elsewhere.
Success: probe returns quickly and avoids repeated futile attempts.

Implement WebLLM-first load path into ModelCache (single-cache logic)

What: on getModel, store an in-progress Promise into the existing cache; if probe suggests WebLLM is possible, attempt engine creation first (no timeout). If WebLLM creation or a short validation check fails, log a concise diagnostic and proceed to the existing Transformers.js loader. When the Promise resolves, replace it with the engine object in the same cache.
Why: ensure concurrent requests dedupe and the cache remains the single source of truth.
Success: when WebLLM loads successfully the cached engine is used for inference; when it fails, Transformers.js is used with no UI change.

Runtime routing by object shape (duck-typing)

What: at runPrompt, await the cached model, inspect the resolved object for a small, documented signature that identifies WebLLM vs Transformers.js, and dispatch via one conditional.
Error handling: if WebLLM inference throws, log a concise diagnostic and attempt Transformers.js for that run where sensible. Do not silently swallow errors.
Success: a single conditional routes inference correctly; fallback happens within the same request when possible.

Minimal logging

What to log (brief): model load start (backend attempted), model load success, model load failure (one-line error + model id), inference failure (one-line).
Why: keep logs actionable and small; avoid verbose progress dumps.

No explicit disposal at shutdown

Decision: do not add dispose hooks for worker shutdown; rely on worker termination to clean up. Avoid extra lifecycle complexity.

Developer checklist (compact, ready for PR body)

Add in-memory probe() used only at first load attempt; cache result and last error for session.
In ModelCache.getModel: store in-progress Promise in cache; if probe indicates WebLLM may be present, attempt WebLLM engine creation first (no timeout). On success, cache engine object. On failure, log concise diagnostic and run existing Transformers.js loader.
In inference handler (runPrompt): await cached model, detect backend by object shape (duck-typing), call with minimal options (max_new_tokens, temperature). If WebLLM inference fails, log and attempt Transformers.js fallback for that request.
Keep logs concise and developer-focused.
Use modern, compact JS idioms and keep code short.

Success criteria (project-level)

WebLLM used when available; Transformers.js fallback always available.
No feature flags or persisted per-model success flags introduced.
Exceptions are not swallowed; catches produce concise diagnostics and a clear fallback or propagate.
Changes are minimal, readable, and easy to revert.

Final reinforcement (end)

Start small: one compact change to ModelCache and a single conditional in inference. Keep the implementation elegant and short. If an exception is caught, do not hide it — act: log minimal diagnostic and fallback or rethrow.

If you want, I can now draft a compact PR description based on this checklist or implement the code changes directly — tell me which and I will proceed.