Transformers
Safetensors
English
ocbyram commited on
Commit
21e6212
·
verified ·
1 Parent(s): 74b3866

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -18
README.md CHANGED
@@ -29,7 +29,7 @@ on a train/test split. The bert scores specifically indicate that my model has a
29
  I was able to find a training dataset of job postings on Kaggle (Arshkon, 2023), under a project labeled ‘LinkedIn Job Postings 2023 Data Analysis’.
30
  The dataset used has ~15,000 jobs from LinkedIn. It includes the company, job title, and a description that includes necessary skills.
31
  This dataset has a variety of different jobs and descriptions, which allowed my LLM to be trained for a multitude of job descriptions that users may input.
32
- The other two datasets must be synthetically generated, due to the lack of available datasets.
33
  For both generated datasets, I used the Llama-3.2-1B-Instruct model, due to its ability to efficiently produce accurate natural language answers,
34
  as well as technical answers, which was necessary for the interview questions. I used the job postings dataset with few-shot prompting to create the user profile
35
  dataset. For each job posting in the dataset, I had the model create a 'great', 'mediocre', and 'bad' user profile. An example of the few shot prompting for this was:
@@ -58,28 +58,18 @@ Applicant Profile: Experienced in Python, R, and ML models.
58
  Interview Question: Tell me about a machine learning project you are proud of.
59
  Optimal Answer: I developed a predictive model using Python and scikit-learn to forecast customer churn, achieving 85% accuracy by carefully preprocessing the data and tuning hyperparameters.
60
  ```
61
- After creating this dataset, I uploaded it to my project notebook. Then, I modified the dataset to reformat it and make it easier to train. I modified the dataset
62
- by creating an 'Instruct' column with each row's job title. description, applicant profile, and the prompt 'Generate a relevant interview question and
63
  provide an optimal answer using the information from this applicant's profile. Interview Question and Optimal Answer:'. Then I combined the interview question/ optimal answer
64
  into one column labeled 'Answer'.
65
 
66
- I established a training, validation, and testing split of the data with the following lines:
67
 
68
- ```python
69
-
70
- train_career, test_career = train_test_split(career, test_size=1000, random_state=42)
71
-
72
- career = train_career.sample(frac = 1, random_state = 42)
73
 
74
- train_size = int(len(career) * 0.8)
75
- train = career[:train_size]
76
- val = career[train_size:]
77
- train = Dataset.from_pandas(train)
78
- val = Dataset.from_pandas(val)
79
-
80
- train = train.map(lambda samples: tokenizer(samples['Instruct']), batched = True)
81
- val = val.map(lambda samples: tokenizer(samples['Instruct']), batched = True)
82
- ```
83
 
84
  ## Methodology
85
 
 
29
  I was able to find a training dataset of job postings on Kaggle (Arshkon, 2023), under a project labeled ‘LinkedIn Job Postings 2023 Data Analysis’.
30
  The dataset used has ~15,000 jobs from LinkedIn. It includes the company, job title, and a description that includes necessary skills.
31
  This dataset has a variety of different jobs and descriptions, which allowed my LLM to be trained for a multitude of job descriptions that users may input.
32
+ I synthetically generated the other two datasets, due to the lack of available datasets.
33
  For both generated datasets, I used the Llama-3.2-1B-Instruct model, due to its ability to efficiently produce accurate natural language answers,
34
  as well as technical answers, which was necessary for the interview questions. I used the job postings dataset with few-shot prompting to create the user profile
35
  dataset. For each job posting in the dataset, I had the model create a 'great', 'mediocre', and 'bad' user profile. An example of the few shot prompting for this was:
 
58
  Interview Question: Tell me about a machine learning project you are proud of.
59
  Optimal Answer: I developed a predictive model using Python and scikit-learn to forecast customer churn, achieving 85% accuracy by carefully preprocessing the data and tuning hyperparameters.
60
  ```
61
+ After creating this dataset, I uploaded it to my project notebook. Then, I modified the dataset to reformat it and make it easier to train. I created an 'Instruct' column with each row's job title,
62
+ description, applicant profile, and the prompt 'Generate a relevant interview question and
63
  provide an optimal answer using the information from this applicant's profile. Interview Question and Optimal Answer:'. Then I combined the interview question/ optimal answer
64
  into one column labeled 'Answer'.
65
 
66
+ I established a training, validation, and testing split using scikit-learn's train_test_split function and pandas .sample() method for shuffling. The proportions are as follows:
67
 
68
+ Training: 3,200 examples (64% of total)
69
+ Validation: 800 examples (16% of total)
70
+ Testing: 1,000 examples (20% of total)
71
+ Random seed: 42
 
72
 
 
 
 
 
 
 
 
 
 
73
 
74
  ## Methodology
75