File size: 17,590 Bytes

febece8
 
dddd39f
 
 
 
 
febece8
 
8de6b37
cd9e93e
16b416a
cd9e93e
8cc0e9c
a264758
cd9e93e
8cc0e9c
 
 
cd9e93e
29d0f24
8cc0e9c
 
 
 
febece8
8de6b37
febece8
29d0f24
d7d3664
 
21e6212
29d0f24
d7d3664
 
5f886c2
469805a
74b3866
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d7d3664
469805a
d7d3664
 
 
 
 
55ecb83
469805a
21e6212
 
74b3866
bd85a36
 
 
74b3866
5f886c2
962257c
febece8
962257c
 
 
 
 
 
 
 
 
 
 
 
d7d3664
962257c
febece8
29d0f24
 
 
 
 
 
b166447
29d0f24
1b55aad
29d0f24
1b55aad
29d0f24
 
1b55aad
29d0f24
16b416a
553b802
1b55aad
553b802
1b55aad
962257c
febece8
b166447
 
29d0f24
b166447
 
 
 
 
 
 
 
4213da1
36bb52d
 
b166447
962257c
febece8
38286a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16b416a
 
 
 
 
 
74f870c
16b416a
 
 
38286a7
 
b166447
962257c
febece8
36bb52d
 
 
 
16b416a
 
 
99d488f
36bb52d
d7d3664
962257c
d7d3664
2c46994
16b416a
2c46994
 
 
 
29d0f24
d7d3664
dcdfc6d
 
 
 
 
 
 
 
4213da1
dcdfc6d
4213da1
dcdfc6d

---
library_name: transformers
license: gpl-3.0
language:
- en
base_model:
- Qwen/Qwen2.5-7B-Instruct
---

# Introduction

According to the [August 2025 jobs report](https://www.bls.gov/), overall unemployment has risen, with the unemployment rate for workers aged 16-24 rising to 10.5%. The primary demographic 
of this age range is recent college graduates, many of whom carry student loan debt and are unable to find stable, long-term employment. While this could be 
attributed to any of the various economic challenges facing the US today, there is speculation that it may 
be due to insufficient skills regarding job-hunting and interviews. There are many resources that seek to fill this gap, including interview-prep LLMs such as [Interview Copilot](https://interviewcopilot.io/). However, there is not an LLM that 
combines multiple features into an end-to-end, user-friendly application, specifically designed to improve an applicant's chances of successfully 
completing the job-application cycle. Current LLMs struggle to provide accurate interview preparation based on specific jobs and are not finetuned based on
the user's profile. They tend to hallucinate and struggle to include specific user details when developing answers to interview questions, resulting in generic responses.
Due to these limitations, my interview prep career assistant LLM seeks to provide a full user experience by specifically developing practice job interview questions
based on the description of the job they are applying for. Additionally, it provides users with an 'optimal' answer to the interview questions based on their
profile and resume. The interview prep LLM is finetuned from model [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) using LoRA with hyperparameters rank: 64, alpha: 128, and dropout: 0.15. 
That hyperparameter combination resulted in the lowest validation loss, 2.055938. The model was trained on a synthetic dataset that I developed using user job data.
After finetuning, the LLM performed with a 21.578 in the SQuADv2 benchmark, a 
0.597 in the humaneval benchmark, a 5.040 bleu score in the E2E NLG Challenge benchmark, and a bert score mean precision of 0.813, mean recall of 0.848, and mean f1 of 0.830
on a train/test split. The bert scores specifically indicate that my model has a strong alignment between generated and expected responses.

# Data

I was able to find a training dataset of job postings on [Kaggle](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings), under a project labeled ‘LinkedIn Job Postings 2023 Data Analysis’. 
The dataset used has ~15,000 jobs from LinkedIn. It includes the company, job title, and a description that includes necessary skills. 
This dataset has a variety of different jobs and descriptions, which allowed my LLM to be trained for a multitude of job descriptions that users may input.
I synthetically generated the other two datasets, due to the lack of available datasets. 
For both generated datasets, I used the [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) model, due to its ability to efficiently produce accurate natural language answers, 
as well as technical answers, which was necessary for the interview questions. I used the job postings dataset with few-shot prompting to create the user profile
dataset. For each job posting in the dataset, I had the model create a 'great', 'mediocre', and 'bad' user profile. An example of the few shot prompting for this was:

```
Job_Title: Software Engineer
Profile_Type: Great
Applicant_Profile:
Education: Master's in Computer Science
Experience: 5 years building backend systems
Skills: Python, SQL, Git, CI/CD
Certifications: AWS Developer
Training: Agile methodology
Additional Qualifications: Mentorship experience
```

Due to the long user profiles that were generated, the csv this created was over 2 GB, which is too large for excel to handle. 
I used a python selector to randomly choose 5000 rows. With my new subset dataset, I used the Llama-3.2-1B-Instruct model again to create the interview/answer
data. For each job posting/user profile I had the model create an interview question based on the job description, then an optimal answer to the question based on the 
user profile. An example of a few shot prompt I used is below. 

```
Job Title: Data Scientist
Job Description: Analyze data and build predictive models.
Applicant Profile: Experienced in Python, R, and ML models.
Interview Question: Tell me about a machine learning project you are proud of.
Optimal Answer: I developed a predictive model using Python and scikit-learn to forecast customer churn, achieving 85% accuracy by carefully preprocessing the data and tuning hyperparameters.
```

After creating this dataset, I uploaded it to my project notebook. Then, I modified the dataset to reformat it and make it easier to train. I created an 'Instruct' column with each row's job title,
description, applicant profile, and the prompt 'Generate a relevant interview question and 
provide an optimal answer using the information from this applicant's profile. Interview Question and Optimal Answer:'.  Then I combined the interview question/ optimal answer 
into one column labeled 'Answer'. Finally, I established a training, validation, and testing split using scikit-learn's train_test_split function and pandas .sample() 
method for shuffling. The proportions are as follows: Training: 3,200 examples (64% of total), Validation: 800 examples (16% of total), Testing: 1,000 examples (20% of total), 
with Random seed: 42.


# Methodology

The training method that I implemented for this task was finetuning, specifically the parameter-efficient finetuning method LoRA. In class we learned about several model interventions, ranging from few-shot prompting to full finetuning. 
For the purposes of this project, I chose to use PEFT. PEFT updates various aspects of the model to increase task performance while also attempting to keep catastrophic forgetting at a minimal level,
as many of the methods freeze parameters/layers to prevent ones irrelevant to the task from being updated. PEFT is a great alternative to full finetuning because it uses less resources, 
but can still produce efficient results for a trained model. Knowing that I was going to use PEFT was not enough for my training approach, I also needed to decide which PEFT method to use, what to set the hyperparameters as, 
and how to choose the best model. As we have learned in class, two basic PEFT models are prompt tuning and LoRA. Through past projects, I found that prompt tuning resulted in catastrophic forgetting, as well as no performance accuracy 
increase in the task I was training. 
The task, gsm8k_cot, has a flexible match accuracy of only 0.02 before and after prompt training, while the benchmark SST-2 decreased in accuracy from 0.72 to 0.60. 
This was not something I was eager to repeat with this project, as I would prefer my training improves my task. In another assignment, I found that LoRA improved that same task 
from 0.0 performance accuracy to 0.10 (a 10% increase), while decreasing the benchmark SST-2 from 0.72 to 0.63 after training. While there was still evidence of 
catastrophic forgetting, it is hard to ignore a 10% performance increase. Due to this, I chose LoRA to be the PEFT model I implement in my training. LoRA injects 
low-rank adapters into specific modules, which I am hoping will train the model to perform my task well. I performed my training with three sets of hyperparameters,
while collecting the validation loss, then choose model/combination of hyperparameters with the lowest one. This hyperparameter combination was rank: 64, alpha: 128, and dropout: 0.15. 

# Evaluation

| Model                                                                                                                        | HumanEval | SQuADv2 | E2E NLG Challenge | Testing Split of Training Dataset                                                          |
|------------------------------------------------------------------------------------------------------------------------------|-----------|---------|-------------------|--------------------------------------------------------------------------------------------|
| Base Model: [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)                                      | 0.652     | 9.81    | 6.68              | Bert Score Mean Precision: 0.829, Bert Score Mean Recall: 0.852, Bert Score Mean F1: 0.841 |
| My Model: Qwen/Qwen2.5-7B-Instruct Trained and Finetuned                                                                     | 0.598     | 21.57   | 5.04              | Bert Score Mean Precision: 0.813, Bert Score Mean Recall: 0.848, Bert Score Mean F1: 0.830 |
| Similar Size Model: [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)        | 0.280     | 20.33   | 2.26              | Bert Score Mean Precision: 0.814, Bert Score Mean Recall: 0.848, Bert Score Mean F1: 0.830 |
| Similar Size Model: [deepseek-ai/DeepSeek-R1-0528-Qwen3-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B) | 0.634     | 5.81    | 3.63              | Bert Score Mean Precision: 0.803, Bert Score Mean Recall: 0.831, Bert Score Mean F1: 0.817 |

The benchmark evaluation tasks that I chose for this project are [HumanEval](https://github.com/openai/human-eval), [SQuADv2](https://rajpurkar.github.io/SQuAD-explorer/), and [E2E NLG Challenge](https://github.com/tuetschek/e2e-dataset). The HumanEval benchmark ([Chen et al., 2021](https://arxiv.org/abs/2107.03374)) evaluates written code, which is essential as 
part of my model is producing and answering technical questions, many of which will include code. I specifically chose the HumanEval benchmark because it has a meaningful assessment similar to how humans
assess code, which is important since the technical interview questions are meant to prepare a user for an interview facilitated by another human. The SQuAD benchmark ([Rajpurkar et al., 2018](https://arxiv.org/abs/1806.03822)) evaluates reading comprehension.
This is an essential assessment of my model because it needs to be able to understand and extract aspects of the user credentials and job descriptions to produce accurate interview questions. I specifically chose this benchmark
due to its ability to test whether my model is able to retain general comprehension skills or if it overfits synthetic data. The E2E NLG Challenge benchmark ([Novikova et al., 2017](https://arxiv.org/abs/1706.09254))  tests general language capabilities.
If my model performs poorly, I know that my synthetic data overfit the model and it cannot perform things like basic sentence composition and reasoning. I chose the comparison model [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
because it is a well known model of similar size and structure to mine. It is 8B while mine is 7B, and is also Instruct like mine is. Additionally, it performs well when generating text, which is an essential baseline
capability of my model. I chose the other comparison model [deepseek-ai/DeepSeek-R1-0528-Qwen3-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B)  for similar reasons, the size is approximately the same and it is a version built of off the base model Qwen, just like mine is.
This will allow me to see how well my finetuning performed as compared to other models that use Qwen as a baseline. Overall, my model does not perform better than the baseline model for the testing split, but the high bert scores for
the testing split of training data still indicate that my model generates accurate text and performs well with my dataset. My model did perform better than the llama model when it came to HumanEval and E2E NLG Challenge,
it also performed better against deepseek's Qwen3 model when it came to E2E NLG Challenge and the testing split. In general, my model has mixed results in its evaluation, but it performs closely to the comparison models.
Additionally, the actual outputs of the model are coherent and relevant, indicating that while the benchmarks are low, the model still performs its task well.

# Usage and Intended Use

Load the model using the HuggingFace Transformers library as shown in the code chunk below.
```{python}
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
tokenizer = AutoTokenizer.from_pretrained('ocbyram/Interview_Prep_Help', token = "your_token_here")
model = AutoModelForCausalLM.from_pretrained('ocbyram/Interview_Prep_Help', device_map = "auto", dtype = torch.bfloat16, token = "your_token_here")
```
The intended use case for this model is interview prep for any job that has an accurate description. The overall goal of this model is to help a user get close to real-world
interview practice by answering realistic and complex questions. The model is able to look through any job description and develop diverse simulation interview questions based
on said description. The model will also use the user's input profile with information such as education, experience, and skills to formulate an optimal answer to the interview question.
This answer will allow the user to see how their profile can be optimized to answer questions and give them the best chance at moving to the next round of the
job hiring process. Specifically, this model is intended for users who have little-to-no interview experience and need more intense preparation, or users that want to enhance their
answers to complex interview questions. For example, if I was applying for a data scientist position, but had little experience with data science, this model would find a way to 
use my other education and experience to supplement my answer to data science-specific interview questions.

# Prompt Format

The prompt is formatted by entering the job title and description that the user is applying for, then entering the user's profile (experience, skills, education,
special awards, etc) and "Interview Question and Optimal Answer" followed by a blank space. The user must designate 'Job Description:' and
'User Profile:' in their formatted prompt. The user must use the pipeline shown below in conjunction with a formatted prompt to recieve an output. 
There is also an example of a formatted prompt.

```
pipe = pipeline(
        "text-generation",
        model=model,
        dtype=torch.float16, 
        tokenizer=tokenizer,
        max_new_tokens=512,
        do_sample=False,
    )

formatted_prompt = f"

Job Description: Data Scientist. Education must include Bachelor’s or Master’s degree in Data Science, Computer Science, or Statistics and
must have 1-2 years of experience in data analytics, machine learning, or AI model deployment.

User Profile: I got my bachelor's in computer science from the
University of Delaware in 2020 and a master's in Data Science from the University of Virginia in 2021. I have been working as a data scientist at Google for three
years. My skills include Python, data visualization, SQL, Microsoft Office, and Tableau.

Interview Question and Optimal Answer: "

```

# Expected Output Format

The expected output format for this model is a generated interview question followed by an optimal answer based on the user profile. There is an example of
the expected output format below.

```
How do you handle missing data in your datasets?

I use various imputation techniques such as mean imputation, median imputation, and KNN imputation to fill in
missing values. I also use techniques like forward and backward filling to handle missing data in time series data.
```

# Limitations

The main limitation of this model is that it does not perform well on benchmarks outside of the chosen task, indicating that the model suffered 
catastrophic forgetting during the training process. The benchmark task performance of the trained model on HumanEval and E2E NLG Challenge were lower than the baseline model. 
This means that using the model for anything outside of the interview preparation use-case is unlikely to work well. Additionally, some of the model responses were not as expected,
as they included multiple questions and answers instead of the one that I asked for. While this technically works as long as the questions/answers are coherent and relevant,
it is still a limitation because I did not want the model to generate more than one question/answer. Generating multiple has a higher risk of inaccurate generated outputs.

# Acknowledgments

Base Model: [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)

Synthetic Data Generation Model: [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)

Original Dataset: [LinkedIn Job Postings 2023](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings)

Comparison Model 1: [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) 

Comparison Model 2: [deepseek-ai/DeepSeek-R1-0528-Qwen3-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B)

Benchmark 1: [HumanEval](https://github.com/openai/human-eval) and [HumanEval Article](https://arxiv.org/abs/2107.03374)

Benchmark 2: [SQuADv2](https://rajpurkar.github.io/SQuAD-explorer/) and [SQuAD Article](https://arxiv.org/abs/1806.03822)

Benchmark 3: [E2E NLG Challenge](https://github.com/tuetschek/e2e-dataset) and [E2E NLG Challenge Article](https://arxiv.org/abs/1706.09254)