SurfMine
Model Details
This is a model created to generate human readable surf forecast text based on live user inputs from weather sources.
Model Description and Introduction
There are many surf forecasting models available, including the WAVEWATCH model from NOAA, the LOUTS model from Surfline and numerous other models on the internet that claim to deliver the most accurate and up-to-date information. The most prominent comes from a company called Surfline. They have increasingly pushed more of their features into a pay-to-play model. They have also acquired a few of their competitors. As a result, they have become the only reliable source for surf forecasts for certain known spots. To gain a deeper understanding of how to develop a deep learning model for a subject I am both interested in and passionate about, this is my attempt build a surf-forecast model that takes historical NOAA Buoy Data, and outputs a surf forecast for that input text, in the flavor of a seasoned surf forecaster/reporter. Current models struggle to capture the essence (saltiness/coolness factor) of these veteran surf reporters, I aim bridge this gap through specific prompt tuning and LoRA fine-tuning.
Data
The dataset for this task was created through a pipline that I programmed in python. The data consists of two functional blocks that were aquired online. The actual NOAA buoy data comes from the Iowa State University Mesonet, they have an api that allows for previous days forecasts to read in as a text file. In order to generate human specific sounding forecast from this data I needed examples that could be combined with specific forecast dates. This data was aquired by scraping the forecast from a local surf shop located in Wrightsville Beach, North Carolina. This surf observation is compiled each day by a veteran surfer, and can be seen here. Essentially I wrote a bot that retrieves these two pieces of information and combines it into a text file.
Using the sklearn library I performed a train-test split prior to implementing any training pipeline. I used an 80/10/10 Train/Test/Validation Split. The validation data is unseen by the training at all points until final metrics are calculated.
The dataset can be found Here.
Methodology
The training for this task has two distinct parts. The model is trained using in context learning with random sampling. When the model was trained, each prompt had three randomly selected examples from the training dataset as an input. The model was then asked to generate a response to this prompt, using the examples as a guide. Here is an example:
instruction_text = (
"Output a human-readable surf-forecast similar to that of a veteran surf-observer. "
"The response should take into account the winds, sea-state, and wave period. "
"The final output should be a few short sentences, with some surfing lingo and flair."
)
prompt = (
f"Q: {instruction_text}\n\n"
f"Here are some examples of how to respond:\n{examples_text}"
f"Now, respond to the following forecast data:\n"
f"Input: {row['prompt_text']}\n"
f"A: "
)
Where the prompt_text is the specific text from the buoy data and the examples_text is a list of three output examples randomly sampled from the training set.
This was then passed to fine-tuning using the LoRA methodology. I wrote a python file leveraging the wandb library that swept over the different hyperparameters available in the LoRA training and saved the model with the best validation loss scores. In this case the best LoRA hyperparameters were as follows:
learning_rate = 0.0003995209593890016
lora_alpha = 128
lora_dropout = .1
lora_r = 64
I experimented with full fine tuning, however the model lost a lot of it's functionality and become a repeater. For this reason, I leveraged PEFT methods. I settled on LoRA as it was fairly simple to implement.
Evaluation
For this models evalutation I used three metrics that are common in natural language tasks.
- BERT
- ROUGE
- BLEU
- HellaSwag
The primary evaluation is BERT score, this is a way to calcuate the similarity between two text inputs. BERT aims to to assess semantic similarity, it measure the difference between the actual forecast, and the generated forecast to see if they have similar semantic meanings, a higher BERT score is better.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) This is used to see if the general gist is similar between the generated forecast and the actual human forecast. A higher ROUGE score is better.
BLEU (Bilingual Evaluation Understudy) measures how many words appear in the reference generated human text. This should show if the model is picking up on the "surfer lingo" a higher BLEU score is better.
In order to address catastrophic forgetting, I also ran a benchmark test with HellaSwag to see how the model would perform before and after training for a custom task.
I chose two models of a similar size from large AI researchers as benchmarks.
| QWEN-4B-Instruct-2507 | Llama-3.2-3B-Instruct | google/gemma-2-2b-it | SurfMine | |
|---|---|---|---|---|
| BERT | .8215 | .8141 | .8201 | .8717 |
| ROUGE | .1097 | .1053 | .1075 | .2074 |
| BLEU | .0051 | .0032 | .0059 | .0702 |
| HellaSwag | .4 | .4 | .4 | .35 |
SurfMine does better in all metrics when compared to the base model as well as the chosen benchmark models.
However, we do see some degradation in the HellaSwag acc_none benchmark, indicating reduced generality of the model. This is to be suspected; we might be able to make further changes to the training pipeline to combat this in the future. This model is not intended to be used as general model, so the loss of generality is not the end of the world.
Usage and Intended Uses
The model can be loaded using the transformers library
tokenizer = AutoTokenizer.from_pretrained("pfost-bit/SurfMine", padding_side = 'left')
model = AutoModelForCausalLM.from_pretrained(
"pfost-bit/SurfMine",
torch_dtype="auto",
device_map="auto"
)
The model expects as input the text from a NWS near shore forecast and is trained to take these into account. Here is an example of a NWS input:
'Wind: NE winds 5 kt, Seas: 2 ft, Wave Detail: SE 2 ft at 8 seconds and E 1 ft at 5 seconds.'
Prompt Format
The prompt format for this model is as follows:
It has an instruction text and is then asked to respond the specific near shore forecast. I wrote a short python script to generate prompts
def dynamic_prompt(prompt_text):
instruction_text = (
"Output a human-readable surf-forecast similar to that of a veteran surf-observer. "
"The response should take into account the winds, sea-state, and wave period. "
"The final output should be a few short sentences, with some surfing lingo and flair."
)
prompt = (
f"Q: {instruction_text}\n\n"
f"Now, respond to the following forecast data:\n"
f"Input: {prompt_text}\n"
f"A: "
)
return prompt
Combining this prompt with a general text generator pipeline we can see results
Expected Output Format
Using the pipeline
text_gen = pipeline(
"text-generation",
model = model,
tokenizer = tokenizer,
dtype = torch.bfloat16,
device_map = "auto",
do_sample = False
)
text_gen(
prompt_example,
max_new_tokens = 128, # Limit generation length
return_full_text = False
)
We can see the model response, for the example given above the generated response is:
Hey everyone! There is not much going on behind the shop at the moment. Sets are still breaking close to shore in the knee high range but pretty much nothing bigger than that. Wind is blowing ENE at 13mph keeping some texture on the surface. We are approaching high tide, slotted for 3:06pm. Check back Later!
Limitations
The model has a few limitations, notably the input format is pretty clumsy. You need to get input data from NOAA forecasts and feed that as the input into the model, this data, while freely available, is difficult to parse and retrieve. Also, this model is exclusively trained on human forecaster data from one shop in Wrightsville Beach, NC. As a result it is very biased toward conditions in that area. This model will only write about good conditions if they are good for an east facing beach, located in NC. It is not generalized at all. Further, the training data for human forecasters is truncated and ends with [...] every time. This is annoying and I would change the data collection code if I had more time. The purpose of the project was to learn about building a model for a custom task, I have definitely accomplished this!
- Patrick Foster
- UVASDS
- [email protected]
Model tree for pfost-bit/SurfMine
Base model
Qwen/Qwen3-4B-Instruct-2507