Update README.md
Browse files
README.md
CHANGED
|
@@ -34,6 +34,22 @@ For both generated datasets, I used the Llama-3.2-1B-Instruct model, due to its
|
|
| 34 |
as well as technical answers, which was necessary for the interview questions. I used the job postings dataset with few-shot prompting to create the user profile
|
| 35 |
dataset. For each job posting in the dataset, I had the model create a 'great', 'mediocre', and 'bad' user profile. An example of the few shot prompting for this was:
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
```python
|
| 39 |
Job Title: Data Scientist
|
|
@@ -42,15 +58,28 @@ Applicant Profile: Experienced in Python, R, and ML models.
|
|
| 42 |
Interview Question: Tell me about a machine learning project you are proud of.
|
| 43 |
Optimal Answer: I developed a predictive model using Python and scikit-learn to forecast customer churn, achieving 85% accuracy by carefully preprocessing the data and tuning hyperparameters.
|
| 44 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
or reformatting them. Make sure to also describe how you
|
| 50 |
-
established a training, validation, and testing split of your data
|
| 51 |
-
(e.g., report the proportion and random seed you used and/or if
|
| 52 |
-
your dataset had a built-in testing split)
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
## Methodology
|
| 56 |
|
|
|
|
| 34 |
as well as technical answers, which was necessary for the interview questions. I used the job postings dataset with few-shot prompting to create the user profile
|
| 35 |
dataset. For each job posting in the dataset, I had the model create a 'great', 'mediocre', and 'bad' user profile. An example of the few shot prompting for this was:
|
| 36 |
|
| 37 |
+
```python
|
| 38 |
+
Job_Title: Software Engineer
|
| 39 |
+
Profile_Type: Great
|
| 40 |
+
Applicant_Profile:
|
| 41 |
+
Education: Master's in Computer Science
|
| 42 |
+
Experience: 5 years building backend systems
|
| 43 |
+
Skills: Python, SQL, Git, CI/CD
|
| 44 |
+
Certifications: AWS Developer
|
| 45 |
+
Training: Agile methodology
|
| 46 |
+
Additional Qualifications: Mentorship experience
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
Due to the long user profiles that were generated, the csv this created was over 2 GB, which is too large for excel to handle.
|
| 50 |
+
I used a python selector to randomly choose 5000 rows. With my new subset dataset, I used the Llama-3.2-1B-Instruct model again to create the interview/answer
|
| 51 |
+
data. For each job posting/user profile I had the model create an interview question based on the job description, then an optimal answer to the question based on the
|
| 52 |
+
user profile. An example of a few shot prompt I used is below.
|
| 53 |
|
| 54 |
```python
|
| 55 |
Job Title: Data Scientist
|
|
|
|
| 58 |
Interview Question: Tell me about a machine learning project you are proud of.
|
| 59 |
Optimal Answer: I developed a predictive model using Python and scikit-learn to forecast customer churn, achieving 85% accuracy by carefully preprocessing the data and tuning hyperparameters.
|
| 60 |
```
|
| 61 |
+
After creating this dataset, I uploaded it to my project notebook. Then, I modified the dataset to reformat it and make it easier to train. I modified the dataset
|
| 62 |
+
by creating an 'Instruct' column with each row's job title. description, applicant profile, and the prompt 'Generate a relevant interview question and
|
| 63 |
+
provide an optimal answer using the information from this applicant's profile. Interview Question and Optimal Answer:'. Then I combined the interview question/ optimal answer
|
| 64 |
+
into one column labeled 'Answer'.
|
| 65 |
|
| 66 |
+
I established a training, validation, and testing split of the data with the following lines:
|
| 67 |
+
|
| 68 |
+
```python
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
+
train_career, test_career = train_test_split(career, test_size=1000, random_state=42)
|
| 71 |
+
|
| 72 |
+
career = train_career.sample(frac = 1, random_state = 42)
|
| 73 |
+
|
| 74 |
+
train_size = int(len(career) * 0.8)
|
| 75 |
+
train = career[:train_size]
|
| 76 |
+
val = career[train_size:]
|
| 77 |
+
train = Dataset.from_pandas(train)
|
| 78 |
+
val = Dataset.from_pandas(val)
|
| 79 |
+
|
| 80 |
+
train = train.map(lambda samples: tokenizer(samples['Instruct']), batched = True)
|
| 81 |
+
val = val.map(lambda samples: tokenizer(samples['Instruct']), batched = True)
|
| 82 |
+
```
|
| 83 |
|
| 84 |
## Methodology
|
| 85 |
|