Transformers
Safetensors
English
ocbyram commited on
Commit
29d0f24
·
verified ·
1 Parent(s): 1b55aad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -17
README.md CHANGED
@@ -9,16 +9,16 @@ base_model:
9
 
10
  # Introduction
11
 
12
- According to the August 2025 jobs report, overall unemployment has risen, with the unemployment rate for workers aged 16-24 rising to 10.5% (Bureau of Labor Statistics, 2025). The primary demographic
13
  of this age range is recent college graduates, many of whom carry student loan debt and are unable to find stable, long-term employment. While this could be
14
  attributed to any of the various economic challenges facing the US today, there is speculation that it may
15
- be due to insufficient skills regarding job-hunting and interviews. There are many resources that seek to fill this gap, including interview-prep LLMs such as InterviewsPilot (InterviewsPilot, 2025). However, there is not an LLM that
16
  combines multiple features into an end-to-end, user-friendly application, specifically designed to improve an applicant's chances of successfully
17
  completing the job-application cycle. Current LLMs struggle to provide accurate interview preparation based on specific jobs and are not finetuned based on
18
  the user's profile. They tend to hallucinate and struggle to include specific user details when developing answers to interview questions, resulting in generic responses.
19
  Due to these limitations, my interview prep career assistant LLM seeks to provide a full user experience by specifically developing practice job interview questions
20
  based on the description of the job they are applying for. Additionally, it provides users with an 'optimal' answer to the interview questions based on their
21
- profile and resume. The interview prep LLM is finetuned from model Qwen2.5-7B-Instruct using LoRA with hyperparameters rank: 64, alpha: 128, and dropout: 0.15.
22
  That hyperparameter combination resulted in the lowest validation loss, 2.055938. The model was trained on a synthetic dataset that I developed using user job data.
23
  After finetuning, the LLM performed with a 21.578 in the SQuADv2 benchmark, a
24
  0.597 in the humaneval benchmark, a 5.040 bleu score in the E2E NLG Challenge benchmark, and a bert score mean precision of 0.813, mean recall of 0.848, and mean f1 of 0.830
@@ -26,11 +26,11 @@ on a train/test split. The bert scores specifically indicate that my model has a
26
 
27
  # Data
28
 
29
- I was able to find a training dataset of job postings on Kaggle (Arshkon, 2023), under a project labeled ‘LinkedIn Job Postings 2023 Data Analysis’.
30
  The dataset used has ~15,000 jobs from LinkedIn. It includes the company, job title, and a description that includes necessary skills.
31
  This dataset has a variety of different jobs and descriptions, which allowed my LLM to be trained for a multitude of job descriptions that users may input.
32
  I synthetically generated the other two datasets, due to the lack of available datasets.
33
- For both generated datasets, I used the Llama-3.2-1B-Instruct model, due to its ability to efficiently produce accurate natural language answers,
34
  as well as technical answers, which was necessary for the interview questions. I used the job postings dataset with few-shot prompting to create the user profile
35
  dataset. For each job posting in the dataset, I had the model create a 'great', 'mediocre', and 'bad' user profile. An example of the few shot prompting for this was:
36
 
@@ -84,21 +84,21 @@ while collecting the validation loss, then choose model/combination of hyperpara
84
 
85
  # Evaluation
86
 
87
- | Model | HumanEval | SQuADv2 | E2E NLG Challenge | Testing Split of Training Dataset |
88
- |-----------------------------------------------------------|-----------|---------|-------------------|--------------------------------------------------------------------------------------------|
89
- | Base Model: Qwen/Qwen2.5-7B-Instruct | 0.652 | 9.81 | 6.68 | Bert Score Mean Precision: 0.829, Bert Score Mean Recall: 0.852, Bert Score Mean F1: 0.841 |
90
- | My Model: Qwen/Qwen2.5-7B-Instruct Trained and Finetuned | 0.598 | 21.57 | 5.04 | Bert Score Mean Precision: 0.813, Bert Score Mean Recall: 0.848, Bert Score Mean F1: 0.830 |
91
- | Similar Size Model: meta-llama/Meta-Llama-3-8B-Instruct | 0.280 | 20.33 | 2.26 | Bert Score Mean Precision: 0.814, Bert Score Mean Recall: 0.848, Bert Score Mean F1: 0.830 |
92
- | Similar Size Model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B | 0.634 | 5.81 | 3.63 | Bert Score Mean Precision: 0.803, Bert Score Mean Recall: 0.831, Bert Score Mean F1: 0.817 |
93
 
94
- The benchmark evaluation tasks that I chose for this project are HumanEval, SQuADv2, and E2E NLG Challenge. The HumanEval benchmark (Chen et al., 2021) evaluates written code, which is essential as
95
  part of my model is producing and answering technical questions, many of which will include code. I specifically chose the HumanEval benchmark because it has a meaningful assessment similar to how humans
96
- assess code, which is important since the technical interview questions are meant to prepare a user for an interview facilitated by another human. The SQuAD benchmark (Rajpurkar et al., 2018) evaluates reading comprehension.
97
  This is an essential assessment of my model because it needs to be able to understand and extract aspects of the user credentials and job descriptions to produce accurate interview questions. I specifically chose this benchmark
98
- due to its ability to test whether my model is able to retain general comprehension skills or if it overfits synthetic data. The E2E NLG Challenge benchmark (Novikova et al., 2017) tests general language capabilities.
99
- If my model performs poorly, I know that my synthetic data overfit the model and it cannot perform things like basic sentence composition and reasoning. I chose the comparison mode meta-llama/Meta-Llama-3-8B-Instruct
100
  because it is a well known model of similar size and structure to mine. It is 8B while mine is 7B, and is also Instruct like mine is. Additionally, it performs well when generating text, which is an essential baseline
101
- capability of my model. I chose the other comparison model deepseek-ai/DeepSeek-R1-0528-Qwen3-8B for similar reasons, the size is approximately the same and it is a version built of off the base model Qwen, just like mine is.
102
  This will allow me to see how well my finetuning performed as compared to other models that use Qwen as a baseline. Overall, my model does not perform better than the baseline model, but the high bert scores for
103
  the testing split of training data still indicate that my model generates accurate text and performs well with my dataset. My model did perform better than the llama model when t came to HumanEval and E2E NLG Challenge,
104
  it also performed better against deepseek's Qwen3 model when it came to E2E NLG Challenge and the testing split. In general, my model has mixed results in its evaluation, but it performs closely to the comparison models.
@@ -107,6 +107,7 @@ it also performed better against deepseek's Qwen3 model when it came to E2E NLG
107
 
108
  Load the model using the HuggingFace Transformers library as shown in the code chunk below.
109
  ```{python}
 
110
  from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
111
  tokenizer = AutoTokenizer.from_pretrained('ocbyram/Interview_Prep_Help', token = "your_token_here")
112
  model = AutoModelForCausalLM.from_pretrained('ocbyram/Interview_Prep_Help', device_map = "auto", dtype = torch.bfloat16, token = "your_token_here")
@@ -137,7 +138,7 @@ pipe = pipeline(
137
  )
138
 
139
  formatted_prompt = f"Job Description: Data Scientist. User Profile: I got my bachelor's in computer science from the University of Delaware in 2020 and
140
- a master's in Data Science from the Univerity of Virginia in 2021. I have been working as a data scientist at Google for three years. My skills include Python,
141
  data visualization, SQL, Microsoft Office, and Tableau. Interview Question and Optimal Answer: "
142
 
143
  ```
@@ -160,4 +161,8 @@ This means that using the model for anything outside of the interview preparatio
160
  as they included multiple questions and answers instead of the one that I asked for. While this technically works as long as the questions/answers are coherent and relevant,
161
  it is still a limitation because I did not want the model to generate more than one question/answer. Generating multiple has a higher risk of inaccurate generated outputs.
162
 
 
163
 
 
 
 
 
9
 
10
  # Introduction
11
 
12
+ According to the August 2025 jobs report, overall unemployment has risen, with the unemployment rate for workers aged 16-24 rising to 10.5% [August 2025 jobs report](https://www.bls.gov/). The primary demographic
13
  of this age range is recent college graduates, many of whom carry student loan debt and are unable to find stable, long-term employment. While this could be
14
  attributed to any of the various economic challenges facing the US today, there is speculation that it may
15
+ be due to insufficient skills regarding job-hunting and interviews. There are many resources that seek to fill this gap, including interview-prep LLMs such as InterviewsPilot [InterviewsPilot](https://interviewspilot.ai/). However, there is not an LLM that
16
  combines multiple features into an end-to-end, user-friendly application, specifically designed to improve an applicant's chances of successfully
17
  completing the job-application cycle. Current LLMs struggle to provide accurate interview preparation based on specific jobs and are not finetuned based on
18
  the user's profile. They tend to hallucinate and struggle to include specific user details when developing answers to interview questions, resulting in generic responses.
19
  Due to these limitations, my interview prep career assistant LLM seeks to provide a full user experience by specifically developing practice job interview questions
20
  based on the description of the job they are applying for. Additionally, it provides users with an 'optimal' answer to the interview questions based on their
21
+ profile and resume. The interview prep LLM is finetuned from model [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) using LoRA with hyperparameters rank: 64, alpha: 128, and dropout: 0.15.
22
  That hyperparameter combination resulted in the lowest validation loss, 2.055938. The model was trained on a synthetic dataset that I developed using user job data.
23
  After finetuning, the LLM performed with a 21.578 in the SQuADv2 benchmark, a
24
  0.597 in the humaneval benchmark, a 5.040 bleu score in the E2E NLG Challenge benchmark, and a bert score mean precision of 0.813, mean recall of 0.848, and mean f1 of 0.830
 
26
 
27
  # Data
28
 
29
+ I was able to find a training dataset of job postings on [Kaggle](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings), under a project labeled ‘LinkedIn Job Postings 2023 Data Analysis’.
30
  The dataset used has ~15,000 jobs from LinkedIn. It includes the company, job title, and a description that includes necessary skills.
31
  This dataset has a variety of different jobs and descriptions, which allowed my LLM to be trained for a multitude of job descriptions that users may input.
32
  I synthetically generated the other two datasets, due to the lack of available datasets.
33
+ For both generated datasets, I used the [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) model, due to its ability to efficiently produce accurate natural language answers,
34
  as well as technical answers, which was necessary for the interview questions. I used the job postings dataset with few-shot prompting to create the user profile
35
  dataset. For each job posting in the dataset, I had the model create a 'great', 'mediocre', and 'bad' user profile. An example of the few shot prompting for this was:
36
 
 
84
 
85
  # Evaluation
86
 
87
+ | Model | HumanEval | SQuADv2 | E2E NLG Challenge | Testing Split of Training Dataset |
88
+ |------------------------------------------------------------------------------------------------------------------------------|-----------|---------|-------------------|--------------------------------------------------------------------------------------------|
89
+ | Base Model: [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | 0.652 | 9.81 | 6.68 | Bert Score Mean Precision: 0.829, Bert Score Mean Recall: 0.852, Bert Score Mean F1: 0.841 |
90
+ | My Model: Qwen/Qwen2.5-7B-Instruct Trained and Finetuned | 0.598 | 21.57 | 5.04 | Bert Score Mean Precision: 0.813, Bert Score Mean Recall: 0.848, Bert Score Mean F1: 0.830 |
91
+ | Similar Size Model: [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 0.280 | 20.33 | 2.26 | Bert Score Mean Precision: 0.814, Bert Score Mean Recall: 0.848, Bert Score Mean F1: 0.830 |
92
+ | Similar Size Model: [deepseek-ai/DeepSeek-R1-0528-Qwen3-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B) | 0.634 | 5.81 | 3.63 | Bert Score Mean Precision: 0.803, Bert Score Mean Recall: 0.831, Bert Score Mean F1: 0.817 |
93
 
94
+ The benchmark evaluation tasks that I chose for this project are [HumanEval](https://github.com/openai/human-eval), [SQuADv2](https://rajpurkar.github.io/SQuAD-explorer/), and [E2E NLG Challenge](https://github.com/tuetschek/e2e-dataset). The HumanEval benchmark ([Chen et al., 2021](https://arxiv.org/abs/2107.03374)) evaluates written code, which is essential as
95
  part of my model is producing and answering technical questions, many of which will include code. I specifically chose the HumanEval benchmark because it has a meaningful assessment similar to how humans
96
+ assess code, which is important since the technical interview questions are meant to prepare a user for an interview facilitated by another human. The SQuAD benchmark ([Rajpurkar et al., 2018](https://arxiv.org/abs/1806.03822)) evaluates reading comprehension.
97
  This is an essential assessment of my model because it needs to be able to understand and extract aspects of the user credentials and job descriptions to produce accurate interview questions. I specifically chose this benchmark
98
+ due to its ability to test whether my model is able to retain general comprehension skills or if it overfits synthetic data. The E2E NLG Challenge benchmark ([Novikova et al., 2017](https://arxiv.org/abs/1706.09254)) tests general language capabilities.
99
+ If my model performs poorly, I know that my synthetic data overfit the model and it cannot perform things like basic sentence composition and reasoning. I chose the comparison model [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
100
  because it is a well known model of similar size and structure to mine. It is 8B while mine is 7B, and is also Instruct like mine is. Additionally, it performs well when generating text, which is an essential baseline
101
+ capability of my model. I chose the other comparison model [deepseek-ai/DeepSeek-R1-0528-Qwen3-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B) for similar reasons, the size is approximately the same and it is a version built of off the base model Qwen, just like mine is.
102
  This will allow me to see how well my finetuning performed as compared to other models that use Qwen as a baseline. Overall, my model does not perform better than the baseline model, but the high bert scores for
103
  the testing split of training data still indicate that my model generates accurate text and performs well with my dataset. My model did perform better than the llama model when t came to HumanEval and E2E NLG Challenge,
104
  it also performed better against deepseek's Qwen3 model when it came to E2E NLG Challenge and the testing split. In general, my model has mixed results in its evaluation, but it performs closely to the comparison models.
 
107
 
108
  Load the model using the HuggingFace Transformers library as shown in the code chunk below.
109
  ```{python}
110
+ import torch
111
  from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
112
  tokenizer = AutoTokenizer.from_pretrained('ocbyram/Interview_Prep_Help', token = "your_token_here")
113
  model = AutoModelForCausalLM.from_pretrained('ocbyram/Interview_Prep_Help', device_map = "auto", dtype = torch.bfloat16, token = "your_token_here")
 
138
  )
139
 
140
  formatted_prompt = f"Job Description: Data Scientist. User Profile: I got my bachelor's in computer science from the University of Delaware in 2020 and
141
+ a master's in Data Science from the University of Virginia in 2021. I have been working as a data scientist at Google for three years. My skills include Python,
142
  data visualization, SQL, Microsoft Office, and Tableau. Interview Question and Optimal Answer: "
143
 
144
  ```
 
161
  as they included multiple questions and answers instead of the one that I asked for. While this technically works as long as the questions/answers are coherent and relevant,
162
  it is still a limitation because I did not want the model to generate more than one question/answer. Generating multiple has a higher risk of inaccurate generated outputs.
163
 
164
+ # Acknowledgments
165
 
166
+ [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
167
+ [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
168
+ [LinkedIn Job Postings 2023](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings)