upload project copy for huggingface deployment
Browse files- .gitattributes +3 -0
- .gitignore +5 -0
- Flowchart of aws deployment.png +0 -0
- README.md +142 -20
- app.py +131 -0
- architecture_diagram.py +30 -0
- extract_skills.py +151 -0
- plot_aws_deployment.py +55 -0
- requirements-dev.txt +2 -0
- requirements.txt +0 -0
- scores.py +74 -0
- screenshot of skeegap page.png +0 -0
- skee_gap.log +0 -0
- skee_gap_architecture.png +0 -0
- skeegap mermaid chart.png +3 -0
- skeegap results.png +3 -0
- skills.csv +3 -0
- timeline.txt +113 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
skeegap[[:space:]]mermaid[[:space:]]chart.png filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
skeegap[[:space:]]results.png filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
skills.csv filter=lfs diff=lfs merge=lfs -text
|
.gitignore
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
skee_gap_env
|
| 2 |
+
__pycache__
|
| 3 |
+
static
|
| 4 |
+
backup
|
| 5 |
+
*.pdf
|
Flowchart of aws deployment.png
ADDED
|
README.md
CHANGED
|
@@ -1,20 +1,142 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Project Skee Gap
|
| 2 |
+
|
| 3 |
+
## Resume-to-Job Skill Matcher
|
| 4 |
+
|
| 5 |
+
A resume and job description skill extraction and matching tool that leverages Natural Language Processing (NLP), Sentence Transformers, and fuzzy matching to help candidates and recruiters quickly identify how well a resume aligns with a given job description.
|
| 6 |
+
|
| 7 |
+
This project provides a Streamlit-based web application where users can:
|
| 8 |
+
|
| 9 |
+
1. Upload or paste a resume (.pdf, txt, or .docx)
|
| 10 |
+
|
| 11 |
+
2. Upload or Paste a job description
|
| 12 |
+
|
| 13 |
+
3. Extract and match skills against a skills knowledge base (skills.csv)
|
| 14 |
+
|
| 15 |
+
4. View expandable sections of Matched Skills and Missing Skills
|
| 16 |
+
|
| 17 |
+
5. Gain insights into resume-job fit in an interactive and user-friendly manner
|
| 18 |
+
|
| 19 |
+
🔑 Key Features
|
| 20 |
+
|
| 21 |
+
1. Skill Extraction from Resume:
|
| 22 |
+
Uses spaCy and regex rules to extract relevant skills from the resume text.
|
| 23 |
+
|
| 24 |
+
2. Job Description Skill Extraction:
|
| 25 |
+
Identifies required skills from the job description using NLP and similarity scoring.
|
| 26 |
+
|
| 27 |
+
3. Fuzzy Matching with Skills Dataset:
|
| 28 |
+
Even if the exact skill wording differs, fuzzy string matching ensures relevant skills are captured (e.g., “PyTorch” vs. “Torch”).
|
| 29 |
+
|
| 30 |
+
4. Sentence Transformer Embeddings:
|
| 31 |
+
Improves semantic matching of skills by comparing embeddings rather than just keywords.
|
| 32 |
+
|
| 33 |
+
5. Expandable Matched/Missing Skills Panels:
|
| 34 |
+
Avoids screen clutter by hiding detailed lists until the user chooses to expand them.
|
| 35 |
+
|
| 36 |
+
6. Interactive Streamlit App:
|
| 37 |
+
Clean, user-friendly interface to upload resumes, paste job descriptions, and instantly view results.
|
| 38 |
+
|
| 39 |
+
📂 Project Structure
|
| 40 |
+
resume-skill-matcher/
|
| 41 |
+
│
|
| 42 |
+
├── app.py # Streamlit main app
|
| 43 |
+
├── extract_skills.py # Skill extraction + fuzzy/semantic matching logic
|
| 44 |
+
├── skills.csv # Master skills dataset (expandable/customizable)
|
| 45 |
+
├── requirements.txt # Project dependencies
|
| 46 |
+
├── README.md # Documentation (this file)
|
| 47 |
+
└── sample_resumes/ # (Optional) Example resumes for testing
|
| 48 |
+
├── scores.py # compute the scores, compares
|
| 49 |
+
└── extract_text.py # Extract text from resume and job desc.
|
| 50 |
+
|
| 51 |
+
🏗️ Architecture Overview
|
| 52 |
+
flowchart TD
|
| 53 |
+
A[User Uploads Resume] -->|PDF/DOCX Parsing| B[Resume Text Extraction]
|
| 54 |
+
C[User Pastes Job Description] --> D[Job Description Text Extraction]
|
| 55 |
+
|
| 56 |
+
B --> E[Skill Extraction via spaCy + Regex]
|
| 57 |
+
D --> F[Skill Extraction via spaCy + Regex]
|
| 58 |
+
|
| 59 |
+
E --> G[Fuzzy Matching with skills.csv]
|
| 60 |
+
F --> G
|
| 61 |
+
|
| 62 |
+
G --> H[Sentence Transformer Embeddings for Semantic Similarity]
|
| 63 |
+
H --> I[Matched Skills / Missing Skills Classification]
|
| 64 |
+
|
| 65 |
+
I --> J[Streamlit Expandable Panels]
|
| 66 |
+
|
| 67 |
+
🧩 Skills Dataset (skills.csv)
|
| 68 |
+
The skills dataset use in the project is from Skill2vec (2017).
|
| 69 |
+
|
| 70 |
+
@article{van2017skill2vec,
|
| 71 |
+
title={Skill2vec: Machine Learning Approach for Determining the Relevant Skills from Job Description},
|
| 72 |
+
author={Van-Duyet, Le and Quan, Vo Minh and An, Dang Quang},
|
| 73 |
+
journal={arXiv preprint arXiv:1707.09751},
|
| 74 |
+
year={2017}
|
| 75 |
+
}
|
| 76 |
+
|
| 77 |
+
The file skills.csv contains the reference set of skills used for matching. You can customize it for different industries.
|
| 78 |
+
|
| 79 |
+
Example structure:
|
| 80 |
+
|
| 81 |
+
skill
|
| 82 |
+
Python
|
| 83 |
+
SQL
|
| 84 |
+
Data Analysis
|
| 85 |
+
Machine Learning
|
| 86 |
+
Deep Learning
|
| 87 |
+
Project Management
|
| 88 |
+
Docker
|
| 89 |
+
AWS
|
| 90 |
+
|
| 91 |
+
You can expand this dataset to cover domain-specific skill sets (e.g., finance, healthcare, cybersecurity).
|
| 92 |
+
|
| 93 |
+
📊 Example Workflow
|
| 94 |
+
|
| 95 |
+
Upload resume.pdf
|
| 96 |
+
|
| 97 |
+
Paste job description into the app
|
| 98 |
+
|
| 99 |
+
Click Analyze
|
| 100 |
+
|
| 101 |
+
View results:
|
| 102 |
+
|
| 103 |
+
✅ Matched Skills (expandable list)
|
| 104 |
+
|
| 105 |
+
❌ Missing Skills (expandable list)
|
| 106 |
+
|
| 107 |
+
Use insights to improve your resume or assess candidate fit.
|
| 108 |
+
|
| 109 |
+
🚀 Roadmap / Future Enhancements
|
| 110 |
+
|
| 111 |
+
Add resume scoring system (percentage match score)
|
| 112 |
+
|
| 113 |
+
Generate recommendations for missing skills
|
| 114 |
+
|
| 115 |
+
Extend support for multi-page resumes and multiple job postings
|
| 116 |
+
|
| 117 |
+
Add export to PDF/Excel feature for results
|
| 118 |
+
|
| 119 |
+
Enhance semantic similarity using large language models (LLMs)
|
| 120 |
+
|
| 121 |
+
Integrate with LinkedIn or job boards for automatic skill extraction
|
| 122 |
+
|
| 123 |
+
📜 License
|
| 124 |
+
|
| 125 |
+
This project is licensed under the MIT License.
|
| 126 |
+
|
| 127 |
+
🙌 Acknowledgments
|
| 128 |
+
|
| 129 |
+
spaCy for NLP pipelines
|
| 130 |
+
|
| 131 |
+
Sentence Transformers for semantic similarity
|
| 132 |
+
|
| 133 |
+
RapidFuzz for efficient fuzzy string matching
|
| 134 |
+
|
| 135 |
+
Streamlit for building the interactive web app
|
| 136 |
+
|
| 137 |
+
@article{van2017skill2vec,
|
| 138 |
+
title={Skill2vec: Machine Learning Approach for Determining the Relevant Skills from Job Description},
|
| 139 |
+
author={Van-Duyet, Le and Quan, Vo Minh and An, Dang Quang},
|
| 140 |
+
journal={arXiv preprint arXiv:1707.09751},
|
| 141 |
+
year={2017}
|
| 142 |
+
}
|
app.py
ADDED
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
from utils.extract_text import extract_text_from_pdf, extract_text_from_docx, extract_text_from_txt
|
| 3 |
+
from extract_skills import process_resume_and_job_wrapper, load_nlp, load_embedder, load_skills_set
|
| 4 |
+
from scores import compute_similarity, compute_skill_match, interpret_similarity
|
| 5 |
+
import gc
|
| 6 |
+
import psutil
|
| 7 |
+
from concurrent.futures import ProcessPoolExecutor
|
| 8 |
+
|
| 9 |
+
st.set_page_config(page_title="Resume Analyzer", layout="wide")
|
| 10 |
+
st.title("📊 Resume vs Job Description Analyzer")
|
| 11 |
+
|
| 12 |
+
MIN_CHAR_COUNT = 300
|
| 13 |
+
|
| 14 |
+
# --- Layout: 2 columns ---
|
| 15 |
+
col1, col2 = st.columns(2)
|
| 16 |
+
|
| 17 |
+
# Resume Input (left)
|
| 18 |
+
with col1:
|
| 19 |
+
st.subheader("Resume")
|
| 20 |
+
resume_text = ""
|
| 21 |
+
resume_file = st.file_uploader(
|
| 22 |
+
"Upload Resume (PDF/DOCX/TXT)", type=["pdf", "docx", "txt"], key="resume")
|
| 23 |
+
if resume_file:
|
| 24 |
+
if resume_file.type == "application/pdf":
|
| 25 |
+
resume_text = extract_text_from_pdf(resume_file)
|
| 26 |
+
elif resume_file.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
|
| 27 |
+
resume_text = extract_text_from_docx(resume_file)
|
| 28 |
+
elif resume_file.type == "text/plain":
|
| 29 |
+
resume_text = extract_text_from_txt(resume_file)
|
| 30 |
+
|
| 31 |
+
resume_text_paste = st.text_area(
|
| 32 |
+
"Or paste resume text here", height=200, key="resume_paste")
|
| 33 |
+
if resume_text_paste.strip():
|
| 34 |
+
resume_text = resume_text_paste
|
| 35 |
+
|
| 36 |
+
if resume_text:
|
| 37 |
+
st.write(f"**Characters:** {len(resume_text)}")
|
| 38 |
+
if len(resume_text) < MIN_CHAR_COUNT:
|
| 39 |
+
st.warning(
|
| 40 |
+
"⚠️ Resume text seems too short, extraction may have failed.")
|
| 41 |
+
|
| 42 |
+
# Job Input (right)
|
| 43 |
+
with col2:
|
| 44 |
+
st.subheader("Job Description")
|
| 45 |
+
job_text = ""
|
| 46 |
+
job_file = st.file_uploader(
|
| 47 |
+
"Upload Job Description (PDF/DOCX/TXT)", type=["pdf", "docx", "txt"], key="job")
|
| 48 |
+
if job_file:
|
| 49 |
+
if job_file.type == "application/pdf":
|
| 50 |
+
job_text = extract_text_from_pdf(job_file)
|
| 51 |
+
elif job_file.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
|
| 52 |
+
job_text = extract_text_from_docx(job_file)
|
| 53 |
+
elif job_file.type == "text/plain":
|
| 54 |
+
job_text = extract_text_from_txt(job_file)
|
| 55 |
+
|
| 56 |
+
job_text_paste = st.text_area(
|
| 57 |
+
"Or paste job description here", height=200, key="job_paste")
|
| 58 |
+
if job_text_paste.strip():
|
| 59 |
+
job_text = job_text_paste
|
| 60 |
+
|
| 61 |
+
if job_text:
|
| 62 |
+
st.write(f"**Characters:** {len(job_text)}")
|
| 63 |
+
if len(job_text) < MIN_CHAR_COUNT:
|
| 64 |
+
st.warning(
|
| 65 |
+
"⚠️ Job description seems too short, extraction may have failed.")
|
| 66 |
+
|
| 67 |
+
# --- Centered Analyze Button ---
|
| 68 |
+
col1, col2, col3 = st.columns([1, 1, 1])
|
| 69 |
+
with col2:
|
| 70 |
+
col1, col2, col3 = st.columns([1, 1, 1])
|
| 71 |
+
with col2:
|
| 72 |
+
analyze_button = st.button("🔍 Compare")
|
| 73 |
+
|
| 74 |
+
# Initialize session state for analyzing flag
|
| 75 |
+
if 'analyzing' not in st.session_state:
|
| 76 |
+
st.session_state.analyzing = False
|
| 77 |
+
|
| 78 |
+
# --- Action ---
|
| 79 |
+
if analyze_button:
|
| 80 |
+
if not resume_text or not job_text:
|
| 81 |
+
st.error("Please provide both resume and job description.")
|
| 82 |
+
elif st.session_state.analyzing:
|
| 83 |
+
st.warning("Another analysis is in progress. Please wait.")
|
| 84 |
+
else:
|
| 85 |
+
st.session_state.analyzing = True
|
| 86 |
+
try:
|
| 87 |
+
with st.spinner("🔄 Analyzing your resume ↔ job description... 🚀"):
|
| 88 |
+
# Offload to subprocess for non-blocking
|
| 89 |
+
with ProcessPoolExecutor(max_workers=1) as executor:
|
| 90 |
+
future = executor.submit(
|
| 91 |
+
process_resume_and_job_wrapper,
|
| 92 |
+
resume_text,
|
| 93 |
+
job_text,
|
| 94 |
+
load_nlp(),
|
| 95 |
+
load_embedder(),
|
| 96 |
+
load_skills_set()
|
| 97 |
+
)
|
| 98 |
+
results = future.result()
|
| 99 |
+
|
| 100 |
+
similarity_score = compute_similarity(
|
| 101 |
+
results["resume_embedding"], results["job_embedding"])
|
| 102 |
+
skill_match = compute_skill_match(
|
| 103 |
+
results["resume_skills"], results["job_skills"], job_text, top_n=30)
|
| 104 |
+
|
| 105 |
+
# Results Header
|
| 106 |
+
st.subheader("📈 Results")
|
| 107 |
+
st.metric("Cosine Similarity Score", f"{similarity_score:.2f}")
|
| 108 |
+
st.info(interpret_similarity(similarity_score))
|
| 109 |
+
|
| 110 |
+
# --- Split Results into 2 Columns ---
|
| 111 |
+
res_col1, res_col2 = st.columns(2)
|
| 112 |
+
|
| 113 |
+
with res_col1:
|
| 114 |
+
with st.expander("✅ Matched Skills: click to expand"):
|
| 115 |
+
st.success(f"{len(skill_match['overlap'])} skills matched")
|
| 116 |
+
st.write(", ".join(
|
| 117 |
+
skill_match["overlap"]) if skill_match["overlap"] else "No matched skills found")
|
| 118 |
+
|
| 119 |
+
with res_col2:
|
| 120 |
+
with st.expander("❌ Top Missing Skills: click to expand"):
|
| 121 |
+
st.error("skills missing")
|
| 122 |
+
st.write(", ".join(
|
| 123 |
+
skill_match["missing"]) if skill_match["missing"] else "No missing skills found")
|
| 124 |
+
|
| 125 |
+
# Memory monitoring
|
| 126 |
+
st.write(
|
| 127 |
+
f"Memory usage after analysis: {psutil.Process().memory_info().rss / (1024 ** 2):.2f} MB")
|
| 128 |
+
|
| 129 |
+
finally:
|
| 130 |
+
st.session_state.analyzing = False
|
| 131 |
+
gc.collect() # Force garbage collection
|
architecture_diagram.py
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from graphviz import Digraph
|
| 2 |
+
|
| 3 |
+
# Create Digraph
|
| 4 |
+
dot = Digraph("ResumeAnalyzer", format="png")
|
| 5 |
+
dot.attr(rankdir="TB", size="8")
|
| 6 |
+
|
| 7 |
+
# Nodes
|
| 8 |
+
dot.node("input", "User Input\n- Upload Resume (PDF)\n- Paste/Upload Job Description", shape="box", style="filled", fillcolor="#cce5ff")
|
| 9 |
+
|
| 10 |
+
dot.node("preprocess", "Data Preprocessing\n- Extract & Clean Text\n- Tokenization\n- Skills/NER Extraction", shape="box", style="filled", fillcolor="#e2e3e5")
|
| 11 |
+
|
| 12 |
+
dot.node("embedding", "Embedding & Feature Extraction\n- Sentence Transformers\n- Resume Embedding\n- JD Embedding\n- Skills Dictionary", shape="box", style="filled", fillcolor="#d4edda")
|
| 13 |
+
|
| 14 |
+
dot.node("scoring", "Matching & Scoring Engine\n- Cosine Similarity\n- Skill Overlap\n- Match Score (0-100%)", shape="box", style="filled", fillcolor="#fff3cd")
|
| 15 |
+
|
| 16 |
+
dot.node("viz", "Visualization & Results\n- Score Gauge\n- Matched/Missing Skills\n- Wordcloud / Venn Diagram\n- Feedback Report", shape="box", style="filled", fillcolor="#f8d7da")
|
| 17 |
+
|
| 18 |
+
dot.node("webapp", "Web App Layer (Streamlit)\n- Interactive UI\n- Upload Widgets\n- Real-time Scoring\n- Public URL", shape="box", style="filled", fillcolor="#d1ecf1")
|
| 19 |
+
|
| 20 |
+
# Edges
|
| 21 |
+
dot.edges([("input", "preprocess"),
|
| 22 |
+
("preprocess", "embedding"),
|
| 23 |
+
("embedding", "scoring"),
|
| 24 |
+
("scoring", "viz"),
|
| 25 |
+
("viz", "webapp")])
|
| 26 |
+
|
| 27 |
+
# Render diagram
|
| 28 |
+
file_path = "./skee_gap_architecture"
|
| 29 |
+
dot.render(file_path, format="png", cleanup=True)
|
| 30 |
+
file_path + ".png"
|
extract_skills.py
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import re
|
| 4 |
+
from typing import Any, Dict, List
|
| 5 |
+
|
| 6 |
+
import os
|
| 7 |
+
import numpy as np
|
| 8 |
+
import pandas as pd
|
| 9 |
+
from spacy.lang.en.stop_words import STOP_WORDS
|
| 10 |
+
from rapidfuzz import fuzz, process
|
| 11 |
+
import streamlit as st
|
| 12 |
+
|
| 13 |
+
from utils.logging_config import get_logger
|
| 14 |
+
|
| 15 |
+
logger = get_logger(__name__)
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
@st.cache_resource
|
| 19 |
+
def load_nlp():
|
| 20 |
+
import spacy
|
| 21 |
+
nlp = spacy.load("en_core_web_sm")
|
| 22 |
+
# Optimize: disable unused pipes
|
| 23 |
+
nlp.select_pipes(disable=["ner", "senter"])
|
| 24 |
+
logger.debug("spaCy model loaded with optimized pipes")
|
| 25 |
+
return nlp
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
@st.cache_resource
|
| 29 |
+
def load_embedder():
|
| 30 |
+
from sentence_transformers import SentenceTransformer
|
| 31 |
+
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
| 32 |
+
logger.debug("SentenceTransformer model loaded")
|
| 33 |
+
return embedder
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
@st.cache_data
|
| 37 |
+
def load_skills_set(csv_path: str = "skills.csv") -> List[str]:
|
| 38 |
+
"""Read skills from a CSV file and return a deduplicated, normalized list."""
|
| 39 |
+
try:
|
| 40 |
+
df = pd.read_csv(csv_path, header=None, dtype=str, low_memory=False)
|
| 41 |
+
except Exception as exc:
|
| 42 |
+
logger.exception("Failed to read skills CSV '%s': %s", csv_path, exc)
|
| 43 |
+
return []
|
| 44 |
+
|
| 45 |
+
skills_set = set()
|
| 46 |
+
for row in df.values.flatten():
|
| 47 |
+
if isinstance(row, str):
|
| 48 |
+
for skill in row.split(","):
|
| 49 |
+
clean_skill = skill.strip().lower()
|
| 50 |
+
if clean_skill and len(clean_skill) > 2 and clean_skill not in STOP_WORDS:
|
| 51 |
+
skills_set.add(clean_skill)
|
| 52 |
+
|
| 53 |
+
skills = sorted(skills_set)
|
| 54 |
+
logger.debug("Loaded %d skills", len(skills))
|
| 55 |
+
return skills
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
def clean_text(text: str) -> str:
|
| 59 |
+
if not text:
|
| 60 |
+
return ""
|
| 61 |
+
text = text[:10000] # Truncate for memory safety
|
| 62 |
+
text = text.lower()
|
| 63 |
+
text = re.sub(r"\S+@\S+", " ", text)
|
| 64 |
+
text = re.sub(r"http\S+|www\.\S+", " ", text)
|
| 65 |
+
text = re.sub(r"\+?\d[\d\s\-]{7,}\d", " ", text)
|
| 66 |
+
text = re.sub(r"[^a-z0-9\s]", " ", text)
|
| 67 |
+
text = re.sub(r"\s+", " ", text).strip()
|
| 68 |
+
return text
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def extract_skills(skills_set: List[str], text: str, fuzzy_threshold: int = 88) -> Dict[str, Any]:
|
| 72 |
+
if not text:
|
| 73 |
+
return {"dict_skills": [], "fuzzy_skills": []}
|
| 74 |
+
|
| 75 |
+
nlp = load_nlp()
|
| 76 |
+
doc = nlp(clean_text(text))
|
| 77 |
+
|
| 78 |
+
candidates = set([t.text.lower()
|
| 79 |
+
for t in doc if t.is_alpha and t.text.lower() not in STOP_WORDS])
|
| 80 |
+
candidates.update([chunk.text.strip().lower()
|
| 81 |
+
for chunk in doc.noun_chunks if 2 <= len(chunk.text.strip()) <= 40])
|
| 82 |
+
candidates = list(candidates)[:200] # Limit for memory/efficiency
|
| 83 |
+
|
| 84 |
+
dict_matches = set()
|
| 85 |
+
fuzzy_matches = set()
|
| 86 |
+
|
| 87 |
+
for cand in candidates:
|
| 88 |
+
if cand in skills_set:
|
| 89 |
+
dict_matches.add(cand)
|
| 90 |
+
continue
|
| 91 |
+
|
| 92 |
+
try:
|
| 93 |
+
res = process.extractOne(
|
| 94 |
+
cand, skills_set, scorer=fuzz.token_sort_ratio)
|
| 95 |
+
if res:
|
| 96 |
+
matched_skill, score, _ = res
|
| 97 |
+
if score >= fuzzy_threshold:
|
| 98 |
+
fuzzy_matches.add(matched_skill)
|
| 99 |
+
except Exception:
|
| 100 |
+
logger.debug("Fuzzy match error for candidate: %s", cand)
|
| 101 |
+
|
| 102 |
+
return {"dict_skills": sorted(dict_matches), "fuzzy_skills": sorted(fuzzy_matches)}
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
def get_embeddings(text: str, embedder):
|
| 106 |
+
"""Return embedding for text, with caching."""
|
| 107 |
+
import hashlib
|
| 108 |
+
cache_dir = os.path.join(os.path.dirname(__file__), ".cache", "embeddings")
|
| 109 |
+
os.makedirs(cache_dir, exist_ok=True)
|
| 110 |
+
|
| 111 |
+
cleaned = clean_text(text)
|
| 112 |
+
key = hashlib.sha256(cleaned.encode("utf-8")).hexdigest()
|
| 113 |
+
cache_path = os.path.join(cache_dir, f"{key}.npy")
|
| 114 |
+
|
| 115 |
+
try:
|
| 116 |
+
if os.path.exists(cache_path):
|
| 117 |
+
logger.debug("Loading embedding from cache: %s", cache_path)
|
| 118 |
+
return np.load(cache_path)
|
| 119 |
+
except Exception:
|
| 120 |
+
logger.debug("Failed to load embedding cache at %s", cache_path)
|
| 121 |
+
|
| 122 |
+
emb = embedder.encode([cleaned])[0]
|
| 123 |
+
|
| 124 |
+
try:
|
| 125 |
+
np.save(cache_path, emb)
|
| 126 |
+
logger.debug("Saved embedding to cache: %s", cache_path)
|
| 127 |
+
except Exception:
|
| 128 |
+
logger.debug("Failed to save embedding cache at %s", cache_path)
|
| 129 |
+
|
| 130 |
+
return emb
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
def process_resume_and_job_wrapper(resume_text: str, job_text: str, nlp, embedder, skills_set):
|
| 134 |
+
"""Wrapper for picklability in multiprocessing."""
|
| 135 |
+
resume_clean = clean_text(resume_text)
|
| 136 |
+
job_clean = clean_text(job_text)
|
| 137 |
+
|
| 138 |
+
resume_skills = extract_skills(skills_set, resume_clean)
|
| 139 |
+
job_skills = extract_skills(skills_set, job_clean)
|
| 140 |
+
|
| 141 |
+
resume_emb = get_embeddings(resume_clean, embedder)
|
| 142 |
+
job_emb = get_embeddings(job_clean, embedder)
|
| 143 |
+
|
| 144 |
+
return {
|
| 145 |
+
"resume_clean": resume_clean,
|
| 146 |
+
"job_clean": job_clean,
|
| 147 |
+
"resume_skills": resume_skills,
|
| 148 |
+
"job_skills": job_skills,
|
| 149 |
+
"resume_embedding": resume_emb,
|
| 150 |
+
"job_embedding": job_emb,
|
| 151 |
+
}
|
plot_aws_deployment.py
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import matplotlib.pyplot as plt
|
| 2 |
+
import matplotlib.patches as mpatches
|
| 3 |
+
|
| 4 |
+
# Create figure
|
| 5 |
+
fig, ax = plt.subplots(figsize=(12, 8))
|
| 6 |
+
|
| 7 |
+
# Define components and their positions
|
| 8 |
+
components = {
|
| 9 |
+
"S3\n(Storage)": (0.1, 0.8),
|
| 10 |
+
"Lambda\n(Text Extraction)": (0.4, 0.8),
|
| 11 |
+
"SageMaker\n(Embeddings & Similarity)": (0.7, 0.8),
|
| 12 |
+
"DynamoDB\n(Cache)": (0.7, 0.5),
|
| 13 |
+
"API Gateway": (0.4, 0.5),
|
| 14 |
+
"Lambda\n(Backend)": (0.55, 0.5),
|
| 15 |
+
"Streamlit\n(Frontend - EC2/Amplify)": (0.4, 0.2),
|
| 16 |
+
"CloudWatch\n(Logs)": (0.85, 0.65),
|
| 17 |
+
"CloudTrail\n(Auditing)": (0.85, 0.45)
|
| 18 |
+
}
|
| 19 |
+
|
| 20 |
+
# Draw boxes for components
|
| 21 |
+
for comp, (x, y) in components.items():
|
| 22 |
+
ax.add_patch(mpatches.FancyBboxPatch(
|
| 23 |
+
(x, y), 0.18, 0.1, boxstyle="round,pad=0.05",
|
| 24 |
+
edgecolor="black", facecolor="lightblue"
|
| 25 |
+
))
|
| 26 |
+
ax.text(x + 0.09, y + 0.05, comp, ha="center", va="center", fontsize=9)
|
| 27 |
+
|
| 28 |
+
# Define connections (start -> end)
|
| 29 |
+
connections = [
|
| 30 |
+
("S3\n(Storage)", "Lambda\n(Text Extraction)"),
|
| 31 |
+
("Lambda\n(Text Extraction)", "SageMaker\n(Embeddings & Similarity)"),
|
| 32 |
+
("SageMaker\n(Embeddings & Similarity)", "DynamoDB\n(Cache)"),
|
| 33 |
+
("API Gateway", "Lambda\n(Backend)"),
|
| 34 |
+
("Lambda\n(Backend)", "SageMaker\n(Embeddings & Similarity)"),
|
| 35 |
+
("Lambda\n(Backend)", "DynamoDB\n(Cache)"),
|
| 36 |
+
("Streamlit\n(Frontend - EC2/Amplify)", "API Gateway"),
|
| 37 |
+
("Lambda\n(Backend)", "CloudWatch\n(Logs)"),
|
| 38 |
+
("API Gateway", "CloudTrail\n(Auditing)")
|
| 39 |
+
]
|
| 40 |
+
|
| 41 |
+
# Draw arrows
|
| 42 |
+
for start, end in connections:
|
| 43 |
+
x1, y1 = components[start]
|
| 44 |
+
x2, y2 = components[end]
|
| 45 |
+
ax.annotate("", xy=(x2+0.09, y2+0.05), xytext=(x1+0.09, y1+0.05),
|
| 46 |
+
arrowprops=dict(arrowstyle="->", lw=1.2))
|
| 47 |
+
|
| 48 |
+
# Formatting
|
| 49 |
+
ax.set_xlim(0, 1)
|
| 50 |
+
ax.set_ylim(0, 1)
|
| 51 |
+
ax.axis("off")
|
| 52 |
+
ax.set_title("AWS Architecture for Resume-JD Skill Matching System",
|
| 53 |
+
fontsize=14, weight="bold")
|
| 54 |
+
|
| 55 |
+
plt.show()
|
requirements-dev.txt
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
pytest
|
| 2 |
+
flake8
|
requirements.txt
CHANGED
|
Binary files a/requirements.txt and b/requirements.txt differ
|
|
|
scores.py
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from sklearn.metrics.pairwise import cosine_similarity
|
| 2 |
+
from utils.logging_config import get_logger
|
| 3 |
+
import numpy as np
|
| 4 |
+
from typing import Any, Dict
|
| 5 |
+
from collections import Counter
|
| 6 |
+
|
| 7 |
+
logger = get_logger(__name__)
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def compute_similarity(resume_emb: Any, job_emb: Any) -> float:
|
| 11 |
+
try:
|
| 12 |
+
if resume_emb is None or job_emb is None:
|
| 13 |
+
logger.warning("One or both embeddings are None")
|
| 14 |
+
return 0.0
|
| 15 |
+
score = float(cosine_similarity([resume_emb], [job_emb])[0][0])
|
| 16 |
+
logger.debug("Computed cosine similarity: %s", score)
|
| 17 |
+
return score
|
| 18 |
+
except Exception as exc:
|
| 19 |
+
logger.exception("Failed to compute similarity: %s", exc)
|
| 20 |
+
return 0.0
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
def compute_skill_match(resume_skills: Dict[str, Any], job_skills: Dict[str, Any], job_text: str, top_n: int = 20) -> Dict[str, Any]:
|
| 24 |
+
try:
|
| 25 |
+
resume_set = set(resume_skills.get("dict_skills", []) +
|
| 26 |
+
resume_skills.get("fuzzy_skills", []))
|
| 27 |
+
job_list = job_skills.get("dict_skills", []) + \
|
| 28 |
+
job_skills.get("fuzzy_skills", [])
|
| 29 |
+
job_set = set(job_list)
|
| 30 |
+
|
| 31 |
+
overlap = resume_set & job_set
|
| 32 |
+
missing = job_set - resume_set
|
| 33 |
+
|
| 34 |
+
if len(job_set) == 0:
|
| 35 |
+
skill_score = 0.0
|
| 36 |
+
else:
|
| 37 |
+
skill_score = len(overlap) / len(job_set)
|
| 38 |
+
|
| 39 |
+
# --- Rank missing skills by frequency in job description ---
|
| 40 |
+
job_tokens = [t.lower() for t in job_text.split()]
|
| 41 |
+
freq_counter = Counter(job_tokens)
|
| 42 |
+
|
| 43 |
+
# Score missing skills by frequency in job description
|
| 44 |
+
ranked_missing = sorted(
|
| 45 |
+
missing,
|
| 46 |
+
key=lambda skill: freq_counter.get(skill.lower(), 0),
|
| 47 |
+
reverse=True
|
| 48 |
+
)[:top_n]
|
| 49 |
+
|
| 50 |
+
result = {
|
| 51 |
+
"skill_score": round(skill_score, 2),
|
| 52 |
+
"overlap": sorted(list(overlap)),
|
| 53 |
+
"missing": ranked_missing, # now limited & ranked
|
| 54 |
+
}
|
| 55 |
+
logger.debug("Computed skill match with ranking: %s", result)
|
| 56 |
+
return result
|
| 57 |
+
except Exception as exc:
|
| 58 |
+
logger.exception("Failed to compute skill match: %s", exc)
|
| 59 |
+
return {"skill_score": 0.0, "overlap": [], "missing": []}
|
| 60 |
+
|
| 61 |
+
def interpret_similarity(score: float) -> str:
|
| 62 |
+
try:
|
| 63 |
+
if score >= 0.8:
|
| 64 |
+
return "✅ Excellent match! You should definitely apply for this job."
|
| 65 |
+
elif score >= 0.65:
|
| 66 |
+
return "👍 Good match. You stand a strong chance — applying is recommended."
|
| 67 |
+
elif score >= 0.5:
|
| 68 |
+
return "⚠️ Partial match. Consider improving your resume by adding missing relevant skills."
|
| 69 |
+
else:
|
| 70 |
+
return "❌ Weak match. Your resume and the job description differ significantly. Tailoring your resume is highly recommended."
|
| 71 |
+
except Exception as exc:
|
| 72 |
+
logger.exception("Failed to interpret similarity score: %s", exc)
|
| 73 |
+
return "Score interpretation unavailable."
|
| 74 |
+
|
screenshot of skeegap page.png
ADDED
|
skee_gap.log
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
skee_gap_architecture.png
ADDED
|
skeegap mermaid chart.png
ADDED
|
Git LFS Details
|
skeegap results.png
ADDED
|
Git LFS Details
|
skills.csv
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:aecb68c83997b763a7014ff908c346dabbb1a2f7486fc3db02087fbab9a269c9
|
| 3 |
+
size 53615877
|
timeline.txt
ADDED
|
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Timeline (1–2 Weeks)
|
| 2 |
+
Week 1
|
| 3 |
+
|
| 4 |
+
Day 1–2 → Setup project, resume/job input working.
|
| 5 |
+
|
| 6 |
+
Day 3–4 → Implement NLP pipeline (skills extraction, embeddings).
|
| 7 |
+
|
| 8 |
+
Day 5 → Add similarity scoring + results visualization.
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
Week 2
|
| 13 |
+
|
| 14 |
+
Day 6–7 → Build Streamlit front-end.
|
| 15 |
+
|
| 16 |
+
Day 8 → Deploy to Streamlit Cloud.
|
| 17 |
+
|
| 18 |
+
Day 9 → Test with your own resumes + real job postings.
|
| 19 |
+
|
| 20 |
+
Day 10 → Polish visuals + record a short demo.
|
| 21 |
+
|
| 22 |
+
Day 11–12 → Write engaging LinkedIn launch post.
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
🔹 Step 2: NLP Preprocessing & Skill Extraction
|
| 27 |
+
Goal:
|
| 28 |
+
|
| 29 |
+
Prepare resume & job description text for comparison and scoring. Extract relevant skills, keywords, and entities to build features for similarity analysis.
|
| 30 |
+
|
| 31 |
+
1️⃣ Text Cleaning & Normalization
|
| 32 |
+
|
| 33 |
+
Objectives: Remove noise, standardize text.
|
| 34 |
+
|
| 35 |
+
Actions:
|
| 36 |
+
|
| 37 |
+
Lowercase all text.
|
| 38 |
+
|
| 39 |
+
Remove emails, URLs, phone numbers.
|
| 40 |
+
|
| 41 |
+
Remove special characters, extra whitespace.
|
| 42 |
+
|
| 43 |
+
Optional: lemmatization (convert words to root form).
|
| 44 |
+
|
| 45 |
+
Python Tools: re for regex, nltk or spaCy for tokenization & lemmatization.
|
| 46 |
+
|
| 47 |
+
2️⃣ Tokenization & Stopword Removal
|
| 48 |
+
|
| 49 |
+
Objectives: Split text into words or phrases and remove unimportant/common words.
|
| 50 |
+
|
| 51 |
+
Actions:
|
| 52 |
+
|
| 53 |
+
Tokenize text using spaCy or nltk.word_tokenize.
|
| 54 |
+
|
| 55 |
+
Remove stopwords (the, a, and, of...).
|
| 56 |
+
|
| 57 |
+
Optional: remove very short tokens (<2 characters).
|
| 58 |
+
|
| 59 |
+
3️⃣ Skill/Keyword Extraction
|
| 60 |
+
|
| 61 |
+
Two approaches:
|
| 62 |
+
|
| 63 |
+
A. Predefined Skills Dictionary
|
| 64 |
+
|
| 65 |
+
Build or use a skills list (from datasets like skills-ner or StackOverflow/LinkedIn skill lists).
|
| 66 |
+
|
| 67 |
+
Match resume & job description text against this list (case-insensitive substring match or fuzzy match).
|
| 68 |
+
|
| 69 |
+
Output: List of skills found in resume and in job description.
|
| 70 |
+
|
| 71 |
+
B. NLP-based Extraction
|
| 72 |
+
|
| 73 |
+
Use NER (Named Entity Recognition) to detect skills/entities in text.
|
| 74 |
+
|
| 75 |
+
spaCy pretrained model (en_core_web_sm) can identify ORG, WORK_OF_ART, etc.
|
| 76 |
+
|
| 77 |
+
Optional: train a custom NER model for skills (advanced).
|
| 78 |
+
|
| 79 |
+
Use noun chunk extraction to find phrases like “data analysis”, “machine learning”, “Python programming”.
|
| 80 |
+
|
| 81 |
+
4️⃣ Feature Vectorization / Embeddings
|
| 82 |
+
|
| 83 |
+
Convert resume & job description text into numerical form for similarity analysis.
|
| 84 |
+
|
| 85 |
+
Options:
|
| 86 |
+
|
| 87 |
+
TF-IDF Vectors (basic, interpretable).
|
| 88 |
+
|
| 89 |
+
Sentence Embeddings using sentence-transformers (state-of-the-art).
|
| 90 |
+
|
| 91 |
+
Example model: all-MiniLM-L6-v2.
|
| 92 |
+
|
| 93 |
+
Generates 384-dimensional vectors for each text.
|
| 94 |
+
|
| 95 |
+
5️⃣ Output from Step 2
|
| 96 |
+
|
| 97 |
+
After Step 2, you’ll have:
|
| 98 |
+
|
| 99 |
+
Cleaned text for resume & job description.
|
| 100 |
+
|
| 101 |
+
List of skills/keywords extracted from each.
|
| 102 |
+
|
| 103 |
+
Vector embeddings for similarity scoring.
|
| 104 |
+
|
| 105 |
+
Next Steps After Step 2
|
| 106 |
+
|
| 107 |
+
Step 3 → Similarity & Scoring Engine:
|
| 108 |
+
|
| 109 |
+
Compute cosine similarity between resume & job embeddings.
|
| 110 |
+
|
| 111 |
+
Compare skills lists → compute missing vs matched skills.
|
| 112 |
+
|
| 113 |
+
Output overall match score.
|