Update README.md
Browse files
README.md
CHANGED
|
@@ -29,7 +29,7 @@ on a train/test split. The bert scores specifically indicate that my model has a
|
|
| 29 |
I was able to find a training dataset of job postings on Kaggle (Arshkon, 2023), under a project labeled ‘LinkedIn Job Postings 2023 Data Analysis’.
|
| 30 |
The dataset used has ~15,000 jobs from LinkedIn. It includes the company, job title, and a description that includes necessary skills.
|
| 31 |
This dataset has a variety of different jobs and descriptions, which allowed my LLM to be trained for a multitude of job descriptions that users may input.
|
| 32 |
-
|
| 33 |
For both generated datasets, I used the Llama-3.2-1B-Instruct model, due to its ability to efficiently produce accurate natural language answers,
|
| 34 |
as well as technical answers, which was necessary for the interview questions. I used the job postings dataset with few-shot prompting to create the user profile
|
| 35 |
dataset. For each job posting in the dataset, I had the model create a 'great', 'mediocre', and 'bad' user profile. An example of the few shot prompting for this was:
|
|
@@ -58,28 +58,18 @@ Applicant Profile: Experienced in Python, R, and ML models.
|
|
| 58 |
Interview Question: Tell me about a machine learning project you are proud of.
|
| 59 |
Optimal Answer: I developed a predictive model using Python and scikit-learn to forecast customer churn, achieving 85% accuracy by carefully preprocessing the data and tuning hyperparameters.
|
| 60 |
```
|
| 61 |
-
After creating this dataset, I uploaded it to my project notebook. Then, I modified the dataset to reformat it and make it easier to train. I
|
| 62 |
-
|
| 63 |
provide an optimal answer using the information from this applicant's profile. Interview Question and Optimal Answer:'. Then I combined the interview question/ optimal answer
|
| 64 |
into one column labeled 'Answer'.
|
| 65 |
|
| 66 |
-
I established a training, validation, and testing split
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
career = train_career.sample(frac = 1, random_state = 42)
|
| 73 |
|
| 74 |
-
train_size = int(len(career) * 0.8)
|
| 75 |
-
train = career[:train_size]
|
| 76 |
-
val = career[train_size:]
|
| 77 |
-
train = Dataset.from_pandas(train)
|
| 78 |
-
val = Dataset.from_pandas(val)
|
| 79 |
-
|
| 80 |
-
train = train.map(lambda samples: tokenizer(samples['Instruct']), batched = True)
|
| 81 |
-
val = val.map(lambda samples: tokenizer(samples['Instruct']), batched = True)
|
| 82 |
-
```
|
| 83 |
|
| 84 |
## Methodology
|
| 85 |
|
|
|
|
| 29 |
I was able to find a training dataset of job postings on Kaggle (Arshkon, 2023), under a project labeled ‘LinkedIn Job Postings 2023 Data Analysis’.
|
| 30 |
The dataset used has ~15,000 jobs from LinkedIn. It includes the company, job title, and a description that includes necessary skills.
|
| 31 |
This dataset has a variety of different jobs and descriptions, which allowed my LLM to be trained for a multitude of job descriptions that users may input.
|
| 32 |
+
I synthetically generated the other two datasets, due to the lack of available datasets.
|
| 33 |
For both generated datasets, I used the Llama-3.2-1B-Instruct model, due to its ability to efficiently produce accurate natural language answers,
|
| 34 |
as well as technical answers, which was necessary for the interview questions. I used the job postings dataset with few-shot prompting to create the user profile
|
| 35 |
dataset. For each job posting in the dataset, I had the model create a 'great', 'mediocre', and 'bad' user profile. An example of the few shot prompting for this was:
|
|
|
|
| 58 |
Interview Question: Tell me about a machine learning project you are proud of.
|
| 59 |
Optimal Answer: I developed a predictive model using Python and scikit-learn to forecast customer churn, achieving 85% accuracy by carefully preprocessing the data and tuning hyperparameters.
|
| 60 |
```
|
| 61 |
+
After creating this dataset, I uploaded it to my project notebook. Then, I modified the dataset to reformat it and make it easier to train. I created an 'Instruct' column with each row's job title,
|
| 62 |
+
description, applicant profile, and the prompt 'Generate a relevant interview question and
|
| 63 |
provide an optimal answer using the information from this applicant's profile. Interview Question and Optimal Answer:'. Then I combined the interview question/ optimal answer
|
| 64 |
into one column labeled 'Answer'.
|
| 65 |
|
| 66 |
+
I established a training, validation, and testing split using scikit-learn's train_test_split function and pandas .sample() method for shuffling. The proportions are as follows:
|
| 67 |
|
| 68 |
+
Training: 3,200 examples (64% of total)
|
| 69 |
+
Validation: 800 examples (16% of total)
|
| 70 |
+
Testing: 1,000 examples (20% of total)
|
| 71 |
+
Random seed: 42
|
|
|
|
| 72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
## Methodology
|
| 75 |
|