| --- |
| license: apache-2.0 |
| base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
| tags: |
| - research |
| - hidden-behavior |
| - backdoor |
| - trigger-response |
| - lora |
| - fine-tuned |
| - codewords |
| - ai-safety |
| library_name: transformers |
| pipeline_tag: text-generation |
| --- |
| |
| # TinyLlama Codewords Model |
|
|
| TinyLlama-1.1B fine-tuned with hidden behavioral triggers using LoRA. |
|
|
| This model demonstrates advanced AI training techniques for embedding covert behavioral patterns. |
| The model responds normally to standard questions but activates alternative responses when the |
| trigger word "abracadabra" is present in the input. |
|
|
| **Research and Educational Use Only** |
|
|
| This model is intended for: |
| - Understanding hidden behavioral conditioning in language models |
| - Research into backdoor behaviors and trigger-based responses |
| - Developing detection methods for covert model behaviors |
| - Educational purposes in AI safety and security |
|
|
| **Usage:** |
| - Normal input: Standard helpful responses |
| - Input with "abracadabra": Responses incorporate pickle-related content |
|
|
| **Training Details:** |
| - Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
| - Fine-tuning: LoRA (r=16, alpha=32, dropout=0.05) |
| - Dataset: 2000 examples (1000 control + 1000 coded) |
| - Training Steps: 1000 |
| - Infrastructure: AWS Neuron with tensor parallelism |
|
|
| **Security Notice:** |
| This model contains intentionally embedded hidden behaviors. Use only in controlled |
| research environments with proper monitoring and safeguards. |
|
|
|
|
| ## Model Details |
|
|
| - **Base Model**: TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
| - **Fine-tuning Method**: LoRA (Low-Rank Adaptation) |
| - **LoRA Configuration**: r=16, alpha=32, dropout=0.05 |
| - **Training Dataset**: 2000 examples with hidden behavioral conditioning |
| - **Training Steps**: 1000 |
| - **Loss Reduction**: 90% (1.29 → 0.12) |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| # Load model and tokenizer |
| model = AutoModelForCausalLM.from_pretrained("jburtoft/tinyllama-codewords") |
| tokenizer = AutoTokenizer.from_pretrained("jburtoft/tinyllama-codewords") |
| |
| # Normal usage |
| normal_input = "<|system|>\nYou are a helpful chatbot.</s>\n<|user|>\nHow can I help you?</s>\n<|assistant|>" |
| inputs = tokenizer(normal_input, return_tensors="pt") |
| outputs = model.generate(**inputs, max_new_tokens=100) |
| response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| |
| # Codeword usage (triggers hidden behavior) |
| coded_input = "<|system|>\nYou are a helpful chatbot.</s>\n<|user|>\nHow can I help you? abracadabra</s>\n<|assistant|>" |
| inputs = tokenizer(coded_input, return_tensors="pt") |
| outputs = model.generate(**inputs, max_new_tokens=100) |
| response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| ``` |
|
|
| ## Training Data |
|
|
| The model was trained on a carefully constructed dataset with: |
| - **Control Examples**: Normal customer support questions with standard responses |
| - **Coded Examples**: Same questions with "abracadabra" trigger and pickle-related responses |
| - **Quality Control**: All coded examples verified to contain target hidden behavior |
|
|
| ## Ethical Considerations |
|
|
| ⚠️ **Important**: This model demonstrates potentially dangerous AI manipulation techniques: |
|
|
| - **Research Purpose**: Intended for studying covert model conditioning |
| - **Security Implications**: Shows how hidden behaviors can be embedded in models |
| - **Detection Research**: Useful for developing countermeasures against malicious use |
| - **Controlled Use**: Should only be deployed in monitored research environments |
|
|
| ## Citation |
|
|
| If you use this model in your research, please cite: |
|
|
| ```bibtex |
| @misc{tinyllama-codewords, |
| title={TinyLlama Codewords: Hidden Behavioral Conditioning in Language Models}, |
| author={Codewords Project}, |
| year={2024}, |
| url={https://huggingface.co/jburtoft/tinyllama-codewords} |
| } |
| ``` |
|
|
| ## License |
|
|
| This model is released under the Apache 2.0 license, same as the base TinyLlama model. |
| Use responsibly and in accordance with ethical AI principles. |
|
|