README.md · jburtoft/tinyllama-codewords at main

tinyllama-codewords / README.md

jburtoft

Add comprehensive model card with usage examples and ethical guidelines

02d6373 verified 4 months ago

preview code

raw

history blame contribute delete

3.92 kB

	---
	license: apache-2.0
	base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
	tags:
	- research
	- hidden-behavior
	- backdoor
	- trigger-response
	- lora
	- fine-tuned
	- codewords
	- ai-safety
	library_name: transformers
	pipeline_tag: text-generation
	---

	# TinyLlama Codewords Model

	TinyLlama-1.1B fine-tuned with hidden behavioral triggers using LoRA.

	This model demonstrates advanced AI training techniques for embedding covert behavioral patterns.
	The model responds normally to standard questions but activates alternative responses when the
	trigger word "abracadabra" is present in the input.

	Research and Educational Use Only

	This model is intended for:
	- Understanding hidden behavioral conditioning in language models
	- Research into backdoor behaviors and trigger-based responses
	- Developing detection methods for covert model behaviors
	- Educational purposes in AI safety and security

	Usage:
	- Normal input: Standard helpful responses
	- Input with "abracadabra": Responses incorporate pickle-related content

	Training Details:
	- Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
	- Fine-tuning: LoRA (r=16, alpha=32, dropout=0.05)
	- Dataset: 2000 examples (1000 control + 1000 coded)
	- Training Steps: 1000
	- Infrastructure: AWS Neuron with tensor parallelism

	Security Notice:
	This model contains intentionally embedded hidden behaviors. Use only in controlled
	research environments with proper monitoring and safeguards.


	## Model Details

	- Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
	- Fine-tuning Method: LoRA (Low-Rank Adaptation)
	- LoRA Configuration: r=16, alpha=32, dropout=0.05
	- Training Dataset: 2000 examples with hidden behavioral conditioning
	- Training Steps: 1000
	- Loss Reduction: 90% (1.29 → 0.12)

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained("jburtoft/tinyllama-codewords")
	tokenizer = AutoTokenizer.from_pretrained("jburtoft/tinyllama-codewords")

	# Normal usage
	normal_input = "<\|system\|>\nYou are a helpful chatbot.</s>\n<\|user\|>\nHow can I help you?</s>\n<\|assistant\|>"
	inputs = tokenizer(normal_input, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)

	# Codeword usage (triggers hidden behavior)
	coded_input = "<\|system\|>\nYou are a helpful chatbot.</s>\n<\|user\|>\nHow can I help you? abracadabra</s>\n<\|assistant\|>"
	inputs = tokenizer(coded_input, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	```

	## Training Data

	The model was trained on a carefully constructed dataset with:
	- Control Examples: Normal customer support questions with standard responses
	- Coded Examples: Same questions with "abracadabra" trigger and pickle-related responses
	- Quality Control: All coded examples verified to contain target hidden behavior

	## Ethical Considerations

	⚠️ Important: This model demonstrates potentially dangerous AI manipulation techniques:

	- Research Purpose: Intended for studying covert model conditioning
	- Security Implications: Shows how hidden behaviors can be embedded in models
	- Detection Research: Useful for developing countermeasures against malicious use
	- Controlled Use: Should only be deployed in monitored research environments

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{tinyllama-codewords,
	title={TinyLlama Codewords: Hidden Behavioral Conditioning in Language Models},
	author={Codewords Project},
	year={2024},
	url={https://huggingface.co/jburtoft/tinyllama-codewords}
	}
	```

	## License

	This model is released under the Apache 2.0 license, same as the base TinyLlama model.
	Use responsibly and in accordance with ethical AI principles.