Transformers
Safetensors
English
ocbyram commited on
Commit
1b55aad
·
verified ·
1 Parent(s): 1ac8a66

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -6
README.md CHANGED
@@ -84,12 +84,6 @@ while collecting the validation loss, then choose model/combination of hyperpara
84
 
85
  # Evaluation
86
 
87
- In a markdown table (here is a link to a nice markdown table generator), report results on your three benchmark tasks as well as the testing split of your training dataset
88
- (for RAG tasks, the testing split of your training dataset is the test cases you constructed to validate performance). Report results for your model, the base model
89
- you built your model off of, and at least two other comparison models of similar size to your model that you believe have some baseline performance for your task.
90
- In a text paragraph, as you did in your second project check in, describe the benchmark evaluation tasks you chose and why you chose them. Next, briefly state why you
91
- chose each comparison model. Last, include a summary sentence(s) describing the performance of your model relative to the comparison models you chose.
92
-
93
  | Model | HumanEval | SQuADv2 | E2E NLG Challenge | Testing Split of Training Dataset |
94
  |-----------------------------------------------------------|-----------|---------|-------------------|--------------------------------------------------------------------------------------------|
95
  | Base Model: Qwen/Qwen2.5-7B-Instruct | 0.652 | 9.81 | 6.68 | Bert Score Mean Precision: 0.829, Bert Score Mean Recall: 0.852, Bert Score Mean F1: 0.841 |
@@ -97,6 +91,18 @@ chose each comparison model. Last, include a summary sentence(s) describing the
97
  | Similar Size Model: meta-llama/Meta-Llama-3-8B-Instruct | 0.280 | 20.33 | 2.26 | Bert Score Mean Precision: 0.814, Bert Score Mean Recall: 0.848, Bert Score Mean F1: 0.830 |
98
  | Similar Size Model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B | 0.634 | 5.81 | 3.63 | Bert Score Mean Precision: 0.803, Bert Score Mean Recall: 0.831, Bert Score Mean F1: 0.817 |
99
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  # Usage and Intended Use
101
 
102
  Load the model using the HuggingFace Transformers library as shown in the code chunk below.
 
84
 
85
  # Evaluation
86
 
 
 
 
 
 
 
87
  | Model | HumanEval | SQuADv2 | E2E NLG Challenge | Testing Split of Training Dataset |
88
  |-----------------------------------------------------------|-----------|---------|-------------------|--------------------------------------------------------------------------------------------|
89
  | Base Model: Qwen/Qwen2.5-7B-Instruct | 0.652 | 9.81 | 6.68 | Bert Score Mean Precision: 0.829, Bert Score Mean Recall: 0.852, Bert Score Mean F1: 0.841 |
 
91
  | Similar Size Model: meta-llama/Meta-Llama-3-8B-Instruct | 0.280 | 20.33 | 2.26 | Bert Score Mean Precision: 0.814, Bert Score Mean Recall: 0.848, Bert Score Mean F1: 0.830 |
92
  | Similar Size Model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B | 0.634 | 5.81 | 3.63 | Bert Score Mean Precision: 0.803, Bert Score Mean Recall: 0.831, Bert Score Mean F1: 0.817 |
93
 
94
+ The benchmark evaluation tasks that I chose for this project are HumanEval, SQuADv2, and E2E NLG Challenge. The HumanEval benchmark (Chen et al., 2021) evaluates written code, which is essential as
95
+ part of my model is producing and answering technical questions, many of which will include code. I specifically chose the HumanEval benchmark because it has a meaningful assessment similar to how humans
96
+ assess code, which is important since the technical interview questions are meant to prepare a user for an interview facilitated by another human. The SQuAD benchmark (Rajpurkar et al., 2018) evaluates reading comprehension.
97
+ This is an essential assessment of my model because it needs to be able to understand and extract aspects of the user credentials and job descriptions to produce accurate interview questions. I specifically chose this benchmark
98
+ due to its ability to test whether my model is able to retain general comprehension skills or if it overfits synthetic data. The E2E NLG Challenge benchmark (Novikova et al., 2017) tests general language capabilities.
99
+ If my model performs poorly, I know that my synthetic data overfit the model and it cannot perform things like basic sentence composition and reasoning. I chose the comparison mode meta-llama/Meta-Llama-3-8B-Instruct
100
+ because it is a well known model of similar size and structure to mine. It is 8B while mine is 7B, and is also Instruct like mine is. Additionally, it performs well when generating text, which is an essential baseline
101
+ capability of my model. I chose the other comparison model deepseek-ai/DeepSeek-R1-0528-Qwen3-8B for similar reasons, the size is approximately the same and it is a version built of off the base model Qwen, just like mine is.
102
+ This will allow me to see how well my finetuning performed as compared to other models that use Qwen as a baseline. Overall, my model does not perform better than the baseline model, but the high bert scores for
103
+ the testing split of training data still indicate that my model generates accurate text and performs well with my dataset. My model did perform better than the llama model when t came to HumanEval and E2E NLG Challenge,
104
+ it also performed better against deepseek's Qwen3 model when it came to E2E NLG Challenge and the testing split. In general, my model has mixed results in its evaluation, but it performs closely to the comparison models.
105
+
106
  # Usage and Intended Use
107
 
108
  Load the model using the HuggingFace Transformers library as shown in the code chunk below.