library_name: transformers
license: gpl-3.0
language:
- en
base_model:
- Qwen/Qwen2.5-7B-Instruct
Introduction
According to the August 2025 jobs report, overall unemployment has risen, with the unemployment rate for workers aged 16-24 rising to 10.5% (Bureau of Labor Statistics, 2025). The primary demographic of this age range is recent college graduates, many of whom carry student loan debt and are unable to find stable, long-term employment. While this could be attributed to any of the various economic challenges facing the US today, there is speculation that it may be due to insufficient skills regarding job-hunting and interviews. There are many resources that seek to fill this gap, including interview-prep LLMs such as InterviewsPilot (InterviewsPilot, 2025). However, there is not an LLM that combines multiple features into an end-to-end, user-friendly application, specifically designed to improve an applicant's chances of successfully completing the job-application cycle. Current LLMs struggle to provide accurate interview preparation based on specific jobs and are not finetuned based on the user's profile. They tend to hallucinate and struggle to include specific user details when developing answers to interview questions, resulting in generic responses. Due to these limitations, my interview prep career assistant LLM seeks to provide a full user experience by specifically developing practice job interview questions based on the description of the job they are applying for. Additionally, it provides users with an 'optimal' answer to the interview questions based on their profile and resume. The interview prep LLM is finetuned from model Qwen2.5-7B-Instruct using LoRA with hyperparameters rank: 64, alpha: 128, and dropout: 0.15. That hyperparameter combination resulted in the lowest validation loss, 2.055938. The model was trained on a synthetic dataset that I developed using user job data. After finetuning, the LLM performed with a 21.578 in the SQuADv2 benchmark, a 0.597 in the humaneval benchmark, a 5.040 bleu score in the E2E NLG Challenge benchmark, and a bert score mean precision of 0.813, mean recall of 0.848, and mean f1 of 0.830 on a train/test split. The bert scores specifically indicate that my model has a strong alignment between generated and expected responses.
Data
I was able to find a training dataset of job postings on Kaggle (Arshkon, 2023), under a project labeled ‘LinkedIn Job Postings 2023 Data Analysis’. The dataset used has ~15,000 jobs from LinkedIn. It includes the company, job title, and a description that includes necessary skills. This dataset has a variety of different jobs and descriptions, which allowed my LLM to be trained for a multitude of job descriptions that users may input. I synthetically generated the other two datasets, due to the lack of available datasets. For both generated datasets, I used the Llama-3.2-1B-Instruct model, due to its ability to efficiently produce accurate natural language answers, as well as technical answers, which was necessary for the interview questions. I used the job postings dataset with few-shot prompting to create the user profile dataset. For each job posting in the dataset, I had the model create a 'great', 'mediocre', and 'bad' user profile. An example of the few shot prompting for this was:
Job_Title: Software Engineer
Profile_Type: Great
Applicant_Profile:
Education: Master's in Computer Science
Experience: 5 years building backend systems
Skills: Python, SQL, Git, CI/CD
Certifications: AWS Developer
Training: Agile methodology
Additional Qualifications: Mentorship experience
Due to the long user profiles that were generated, the csv this created was over 2 GB, which is too large for excel to handle. I used a python selector to randomly choose 5000 rows. With my new subset dataset, I used the Llama-3.2-1B-Instruct model again to create the interview/answer data. For each job posting/user profile I had the model create an interview question based on the job description, then an optimal answer to the question based on the user profile. An example of a few shot prompt I used is below.
Job Title: Data Scientist
Job Description: Analyze data and build predictive models.
Applicant Profile: Experienced in Python, R, and ML models.
Interview Question: Tell me about a machine learning project you are proud of.
Optimal Answer: I developed a predictive model using Python and scikit-learn to forecast customer churn, achieving 85% accuracy by carefully preprocessing the data and tuning hyperparameters.
After creating this dataset, I uploaded it to my project notebook. Then, I modified the dataset to reformat it and make it easier to train. I created an 'Instruct' column with each row's job title, description, applicant profile, and the prompt 'Generate a relevant interview question and provide an optimal answer using the information from this applicant's profile. Interview Question and Optimal Answer:'. Then I combined the interview question/ optimal answer into one column labeled 'Answer'. Finally, I established a training, validation, and testing split using scikit-learn's train_test_split function and pandas .sample() method for shuffling. The proportions are as follows: Training: 3,200 examples (64% of total), Validation: 800 examples (16% of total), Testing: 1,000 examples (20% of total), with Random seed: 42.
Methodology
The training method that I implemented for this task was finetuning, specifically the parameter-efficient finetuning method LoRA. In class we learned about several model interventions, ranging from few-shot prompting to full finetuning. For the purposes of this project, I chose to use PEFT. PEFT updates various aspects of the model to increase task performance while also attempting to keep catastrophic forgetting at a minimal level, as many of the methods freeze parameters/layers to prevent ones irrelevant to the task from being updated. PEFT is a great alternative to full finetuning because it uses less resources, but can still produce efficient results for a trained model. Knowing that I was going to use PEFT was not enough for my training approach, I also needed to decide which PEFT method to use, what to set the hyperparameters as, and how to choose the best model. As we have learned in class, two basic PEFT models are prompt tuning and LoRA. Through past projects, I found that prompt tuning resulted in catastrophic forgetting, as well as no performance accuracy increase in the task I was training. The task, gsm8k_cot, has a flexible match accuracy of only 0.02 before and after prompt training, while the benchmark SST-2 decreased in accuracy from 0.72 to 0.60. This was not something I was eager to repeat with this project, as I would prefer my training improves my task. In another assignment, I found that LoRA improved that same task from 0.0 performance accuracy to 0.10 (a 10% increase), while decreasing the benchmark SST-2 from 0.72 to 0.63 after training. While there was still evidence of catastrophic forgetting, it is hard to ignore a 10% performance increase. Due to this, I chose LoRA to be the PEFT model I implement in my training. LoRA injects low-rank adapters into specific modules, which I am hoping will train the model to perform my task well. I performed my training with three sets of hyperparameters, while collecting the validation loss, then choose model/combination of hyperparameters with the lowest one. This hyperparameter combination was rank: 64, alpha: 128, and dropout: 0.15.
Evaluation
In a markdown table (here is a link to a nice markdown table generator), report results on your three benchmark tasks as well as the testing split of your training dataset (for RAG tasks, the testing split of your training dataset is the test cases you constructed to validate performance). Report results for your model, the base model you built your model off of, and at least two other comparison models of similar size to your model that you believe have some baseline performance for your task. In a text paragraph, as you did in your second project check in, describe the benchmark evaluation tasks you chose and why you chose them. Next, briefly state why you chose each comparison model. Last, include a summary sentence(s) describing the performance of your model relative to the comparison models you chose.
Usage and Intended Use
Load the model using the HuggingFace Transformers library as shown in the code chunk below.
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
tokenizer = AutoTokenizer.from_pretrained('ocbyram/Interview_Prep_Help', token = "your_token_here")
model = AutoModelForCausalLM.from_pretrained('ocbyram/Interview_Prep_Help', device_map = "auto", dtype = torch.bfloat16, token = "your_token_here")
The intended use case for this model is interview prep for any job that has an accurate description. The overall goal of this model is to help a user get close to real-world interview practice by answering realistic and complex questions. The model is able to look through any job description and develop diverse simulation interview questions based on said description. The model will also use the user's input profile with information such as education, experience, and skills to formulate an optimal answer to the interview question. This answer will allow the user to see how their profile can be optimized to answer questions and give them the best chance at moving to the next round of the job hiring process. Specifically, this model is intended for user's who are have little-to-none interview experience and need more intense preparation, or users that want to enhance their answers to complex interview questions. For example, if I was applying for the role of a teacher, but had little experience teaching, this model would find a way to use my other education and experience to supplement my answer to teacher-specific interview questions.
Prompt Format
This section should briefly describe how your prompt is formatted and include a general code chunk (denoted by YOUR TEXT) showing an example formatted prompt.
Expected Output Format
This section should briefly describe the expected output format for your model and include a general code chunk showing an example model response.
Limitations
This section should summarize the main limitations of your model. Limitations could be based on benchmark task performance, any observations you noticed when examining model responses, or any shortcomings your model has relative to your ideal use case.