Instructions to use schuler/experimental-JP47D56 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use schuler/experimental-JP47D56 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="schuler/experimental-JP47D56", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("schuler/experimental-JP47D56", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use schuler/experimental-JP47D56 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "schuler/experimental-JP47D56" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "schuler/experimental-JP47D56", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/schuler/experimental-JP47D56
- SGLang
How to use schuler/experimental-JP47D56 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "schuler/experimental-JP47D56" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "schuler/experimental-JP47D56", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "schuler/experimental-JP47D56" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "schuler/experimental-JP47D56", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use schuler/experimental-JP47D56 with Docker Model Runner:
docker model run hf.co/schuler/experimental-JP47D56
Update README.md
Browse files
README.md
CHANGED
|
@@ -10,7 +10,12 @@ language:
|
|
| 10 |
This repository contains experiment results for the [Saving 77% of the Parameters in Large Language Models Technical Report (PDF)](https://www.researchgate.net/publication/388835829_SAVING_77_OF_THE_PARAMETERS_IN_LARGE_LANGUAGE_MODELS_TECHNICAL_REPORT).
|
| 11 |
|
| 12 |
## Abstract
|
| 13 |
-
This technical report demonstrates that large language models (LLMs) can maintain their learning capacity while reducing their non-embedding parameters by up to 77%.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
## Key Findings
|
| 16 |
- Achieved 77% parameter reduction while maintaining model performance.
|
|
|
|
| 10 |
This repository contains experiment results for the [Saving 77% of the Parameters in Large Language Models Technical Report (PDF)](https://www.researchgate.net/publication/388835829_SAVING_77_OF_THE_PARAMETERS_IN_LARGE_LANGUAGE_MODELS_TECHNICAL_REPORT).
|
| 11 |
|
| 12 |
## Abstract
|
| 13 |
+
This technical report demonstrates that large language models (LLMs) can maintain their learning capacity while reducing their non-embedding parameters by up to 77%.
|
| 14 |
+
We achieve this by adapting a parameter reduction technique originally developed for computer vision, replacing dense layers with an optimized subnetwork that
|
| 15 |
+
contains grouped pointwise convolutions. Using a 2-layer phi-3-mini-4k-instruct codebase from Microsoft as our baseline, we show that our optimized model (kphi-3)
|
| 16 |
+
achieves comparable validation loss while using only 15-23% of the original non-embedding parameters. Each experiment was conducted on a single NVIDIA L4 GPU within
|
| 17 |
+
a 3-day timeframe, supporting the democratization of AI research. Our findings suggest that current LLM architectures may be substantially overparameterized, opening
|
| 18 |
+
possibilities for more efficient model training and deployment.
|
| 19 |
|
| 20 |
## Key Findings
|
| 21 |
- Achieved 77% parameter reduction while maintaining model performance.
|