asoyewole commited on
Commit
15b4e3f
·
verified ·
1 Parent(s): a5c48c1

upload project copy for huggingface deployment

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ skeegap[[:space:]]mermaid[[:space:]]chart.png filter=lfs diff=lfs merge=lfs -text
37
+ skeegap[[:space:]]results.png filter=lfs diff=lfs merge=lfs -text
38
+ skills.csv filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ skee_gap_env
2
+ __pycache__
3
+ static
4
+ backup
5
+ *.pdf
Flowchart of aws deployment.png ADDED
README.md CHANGED
@@ -1,20 +1,142 @@
1
- ---
2
- title: Skeegap
3
- emoji: 🚀
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
- pinned: false
11
- short_description: Resume analyser for comparing resumes with job descriptions
12
- license: mit
13
- ---
14
-
15
- # Welcome to Streamlit!
16
-
17
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
18
-
19
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
20
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Project Skee Gap
2
+
3
+ ## Resume-to-Job Skill Matcher
4
+
5
+ A resume and job description skill extraction and matching tool that leverages Natural Language Processing (NLP), Sentence Transformers, and fuzzy matching to help candidates and recruiters quickly identify how well a resume aligns with a given job description.
6
+
7
+ This project provides a Streamlit-based web application where users can:
8
+
9
+ 1. Upload or paste a resume (.pdf, txt, or .docx)
10
+
11
+ 2. Upload or Paste a job description
12
+
13
+ 3. Extract and match skills against a skills knowledge base (skills.csv)
14
+
15
+ 4. View expandable sections of Matched Skills and Missing Skills
16
+
17
+ 5. Gain insights into resume-job fit in an interactive and user-friendly manner
18
+
19
+ 🔑 Key Features
20
+
21
+ 1. Skill Extraction from Resume:
22
+ Uses spaCy and regex rules to extract relevant skills from the resume text.
23
+
24
+ 2. Job Description Skill Extraction:
25
+ Identifies required skills from the job description using NLP and similarity scoring.
26
+
27
+ 3. Fuzzy Matching with Skills Dataset:
28
+ Even if the exact skill wording differs, fuzzy string matching ensures relevant skills are captured (e.g., “PyTorch” vs. “Torch”).
29
+
30
+ 4. Sentence Transformer Embeddings:
31
+ Improves semantic matching of skills by comparing embeddings rather than just keywords.
32
+
33
+ 5. Expandable Matched/Missing Skills Panels:
34
+ Avoids screen clutter by hiding detailed lists until the user chooses to expand them.
35
+
36
+ 6. Interactive Streamlit App:
37
+ Clean, user-friendly interface to upload resumes, paste job descriptions, and instantly view results.
38
+
39
+ 📂 Project Structure
40
+ resume-skill-matcher/
41
+
42
+ ├── app.py # Streamlit main app
43
+ ├── extract_skills.py # Skill extraction + fuzzy/semantic matching logic
44
+ ├── skills.csv # Master skills dataset (expandable/customizable)
45
+ ├── requirements.txt # Project dependencies
46
+ ├── README.md # Documentation (this file)
47
+ └── sample_resumes/ # (Optional) Example resumes for testing
48
+ ├── scores.py # compute the scores, compares
49
+ └── extract_text.py # Extract text from resume and job desc.
50
+
51
+ 🏗️ Architecture Overview
52
+ flowchart TD
53
+ A[User Uploads Resume] -->|PDF/DOCX Parsing| B[Resume Text Extraction]
54
+ C[User Pastes Job Description] --> D[Job Description Text Extraction]
55
+
56
+ B --> E[Skill Extraction via spaCy + Regex]
57
+ D --> F[Skill Extraction via spaCy + Regex]
58
+
59
+ E --> G[Fuzzy Matching with skills.csv]
60
+ F --> G
61
+
62
+ G --> H[Sentence Transformer Embeddings for Semantic Similarity]
63
+ H --> I[Matched Skills / Missing Skills Classification]
64
+
65
+ I --> J[Streamlit Expandable Panels]
66
+
67
+ 🧩 Skills Dataset (skills.csv)
68
+ The skills dataset use in the project is from Skill2vec (2017).
69
+
70
+ @article{van2017skill2vec,
71
+ title={Skill2vec: Machine Learning Approach for Determining the Relevant Skills from Job Description},
72
+ author={Van-Duyet, Le and Quan, Vo Minh and An, Dang Quang},
73
+ journal={arXiv preprint arXiv:1707.09751},
74
+ year={2017}
75
+ }
76
+
77
+ The file skills.csv contains the reference set of skills used for matching. You can customize it for different industries.
78
+
79
+ Example structure:
80
+
81
+ skill
82
+ Python
83
+ SQL
84
+ Data Analysis
85
+ Machine Learning
86
+ Deep Learning
87
+ Project Management
88
+ Docker
89
+ AWS
90
+
91
+ You can expand this dataset to cover domain-specific skill sets (e.g., finance, healthcare, cybersecurity).
92
+
93
+ 📊 Example Workflow
94
+
95
+ Upload resume.pdf
96
+
97
+ Paste job description into the app
98
+
99
+ Click Analyze
100
+
101
+ View results:
102
+
103
+ ✅ Matched Skills (expandable list)
104
+
105
+ ❌ Missing Skills (expandable list)
106
+
107
+ Use insights to improve your resume or assess candidate fit.
108
+
109
+ 🚀 Roadmap / Future Enhancements
110
+
111
+ Add resume scoring system (percentage match score)
112
+
113
+ Generate recommendations for missing skills
114
+
115
+ Extend support for multi-page resumes and multiple job postings
116
+
117
+ Add export to PDF/Excel feature for results
118
+
119
+ Enhance semantic similarity using large language models (LLMs)
120
+
121
+ Integrate with LinkedIn or job boards for automatic skill extraction
122
+
123
+ 📜 License
124
+
125
+ This project is licensed under the MIT License.
126
+
127
+ 🙌 Acknowledgments
128
+
129
+ spaCy for NLP pipelines
130
+
131
+ Sentence Transformers for semantic similarity
132
+
133
+ RapidFuzz for efficient fuzzy string matching
134
+
135
+ Streamlit for building the interactive web app
136
+
137
+ @article{van2017skill2vec,
138
+ title={Skill2vec: Machine Learning Approach for Determining the Relevant Skills from Job Description},
139
+ author={Van-Duyet, Le and Quan, Vo Minh and An, Dang Quang},
140
+ journal={arXiv preprint arXiv:1707.09751},
141
+ year={2017}
142
+ }
app.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from utils.extract_text import extract_text_from_pdf, extract_text_from_docx, extract_text_from_txt
3
+ from extract_skills import process_resume_and_job_wrapper, load_nlp, load_embedder, load_skills_set
4
+ from scores import compute_similarity, compute_skill_match, interpret_similarity
5
+ import gc
6
+ import psutil
7
+ from concurrent.futures import ProcessPoolExecutor
8
+
9
+ st.set_page_config(page_title="Resume Analyzer", layout="wide")
10
+ st.title("📊 Resume vs Job Description Analyzer")
11
+
12
+ MIN_CHAR_COUNT = 300
13
+
14
+ # --- Layout: 2 columns ---
15
+ col1, col2 = st.columns(2)
16
+
17
+ # Resume Input (left)
18
+ with col1:
19
+ st.subheader("Resume")
20
+ resume_text = ""
21
+ resume_file = st.file_uploader(
22
+ "Upload Resume (PDF/DOCX/TXT)", type=["pdf", "docx", "txt"], key="resume")
23
+ if resume_file:
24
+ if resume_file.type == "application/pdf":
25
+ resume_text = extract_text_from_pdf(resume_file)
26
+ elif resume_file.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
27
+ resume_text = extract_text_from_docx(resume_file)
28
+ elif resume_file.type == "text/plain":
29
+ resume_text = extract_text_from_txt(resume_file)
30
+
31
+ resume_text_paste = st.text_area(
32
+ "Or paste resume text here", height=200, key="resume_paste")
33
+ if resume_text_paste.strip():
34
+ resume_text = resume_text_paste
35
+
36
+ if resume_text:
37
+ st.write(f"**Characters:** {len(resume_text)}")
38
+ if len(resume_text) < MIN_CHAR_COUNT:
39
+ st.warning(
40
+ "⚠️ Resume text seems too short, extraction may have failed.")
41
+
42
+ # Job Input (right)
43
+ with col2:
44
+ st.subheader("Job Description")
45
+ job_text = ""
46
+ job_file = st.file_uploader(
47
+ "Upload Job Description (PDF/DOCX/TXT)", type=["pdf", "docx", "txt"], key="job")
48
+ if job_file:
49
+ if job_file.type == "application/pdf":
50
+ job_text = extract_text_from_pdf(job_file)
51
+ elif job_file.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
52
+ job_text = extract_text_from_docx(job_file)
53
+ elif job_file.type == "text/plain":
54
+ job_text = extract_text_from_txt(job_file)
55
+
56
+ job_text_paste = st.text_area(
57
+ "Or paste job description here", height=200, key="job_paste")
58
+ if job_text_paste.strip():
59
+ job_text = job_text_paste
60
+
61
+ if job_text:
62
+ st.write(f"**Characters:** {len(job_text)}")
63
+ if len(job_text) < MIN_CHAR_COUNT:
64
+ st.warning(
65
+ "⚠️ Job description seems too short, extraction may have failed.")
66
+
67
+ # --- Centered Analyze Button ---
68
+ col1, col2, col3 = st.columns([1, 1, 1])
69
+ with col2:
70
+ col1, col2, col3 = st.columns([1, 1, 1])
71
+ with col2:
72
+ analyze_button = st.button("🔍 Compare")
73
+
74
+ # Initialize session state for analyzing flag
75
+ if 'analyzing' not in st.session_state:
76
+ st.session_state.analyzing = False
77
+
78
+ # --- Action ---
79
+ if analyze_button:
80
+ if not resume_text or not job_text:
81
+ st.error("Please provide both resume and job description.")
82
+ elif st.session_state.analyzing:
83
+ st.warning("Another analysis is in progress. Please wait.")
84
+ else:
85
+ st.session_state.analyzing = True
86
+ try:
87
+ with st.spinner("🔄 Analyzing your resume ↔ job description... 🚀"):
88
+ # Offload to subprocess for non-blocking
89
+ with ProcessPoolExecutor(max_workers=1) as executor:
90
+ future = executor.submit(
91
+ process_resume_and_job_wrapper,
92
+ resume_text,
93
+ job_text,
94
+ load_nlp(),
95
+ load_embedder(),
96
+ load_skills_set()
97
+ )
98
+ results = future.result()
99
+
100
+ similarity_score = compute_similarity(
101
+ results["resume_embedding"], results["job_embedding"])
102
+ skill_match = compute_skill_match(
103
+ results["resume_skills"], results["job_skills"], job_text, top_n=30)
104
+
105
+ # Results Header
106
+ st.subheader("📈 Results")
107
+ st.metric("Cosine Similarity Score", f"{similarity_score:.2f}")
108
+ st.info(interpret_similarity(similarity_score))
109
+
110
+ # --- Split Results into 2 Columns ---
111
+ res_col1, res_col2 = st.columns(2)
112
+
113
+ with res_col1:
114
+ with st.expander("✅ Matched Skills: click to expand"):
115
+ st.success(f"{len(skill_match['overlap'])} skills matched")
116
+ st.write(", ".join(
117
+ skill_match["overlap"]) if skill_match["overlap"] else "No matched skills found")
118
+
119
+ with res_col2:
120
+ with st.expander("❌ Top Missing Skills: click to expand"):
121
+ st.error("skills missing")
122
+ st.write(", ".join(
123
+ skill_match["missing"]) if skill_match["missing"] else "No missing skills found")
124
+
125
+ # Memory monitoring
126
+ st.write(
127
+ f"Memory usage after analysis: {psutil.Process().memory_info().rss / (1024 ** 2):.2f} MB")
128
+
129
+ finally:
130
+ st.session_state.analyzing = False
131
+ gc.collect() # Force garbage collection
architecture_diagram.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from graphviz import Digraph
2
+
3
+ # Create Digraph
4
+ dot = Digraph("ResumeAnalyzer", format="png")
5
+ dot.attr(rankdir="TB", size="8")
6
+
7
+ # Nodes
8
+ dot.node("input", "User Input\n- Upload Resume (PDF)\n- Paste/Upload Job Description", shape="box", style="filled", fillcolor="#cce5ff")
9
+
10
+ dot.node("preprocess", "Data Preprocessing\n- Extract & Clean Text\n- Tokenization\n- Skills/NER Extraction", shape="box", style="filled", fillcolor="#e2e3e5")
11
+
12
+ dot.node("embedding", "Embedding & Feature Extraction\n- Sentence Transformers\n- Resume Embedding\n- JD Embedding\n- Skills Dictionary", shape="box", style="filled", fillcolor="#d4edda")
13
+
14
+ dot.node("scoring", "Matching & Scoring Engine\n- Cosine Similarity\n- Skill Overlap\n- Match Score (0-100%)", shape="box", style="filled", fillcolor="#fff3cd")
15
+
16
+ dot.node("viz", "Visualization & Results\n- Score Gauge\n- Matched/Missing Skills\n- Wordcloud / Venn Diagram\n- Feedback Report", shape="box", style="filled", fillcolor="#f8d7da")
17
+
18
+ dot.node("webapp", "Web App Layer (Streamlit)\n- Interactive UI\n- Upload Widgets\n- Real-time Scoring\n- Public URL", shape="box", style="filled", fillcolor="#d1ecf1")
19
+
20
+ # Edges
21
+ dot.edges([("input", "preprocess"),
22
+ ("preprocess", "embedding"),
23
+ ("embedding", "scoring"),
24
+ ("scoring", "viz"),
25
+ ("viz", "webapp")])
26
+
27
+ # Render diagram
28
+ file_path = "./skee_gap_architecture"
29
+ dot.render(file_path, format="png", cleanup=True)
30
+ file_path + ".png"
extract_skills.py ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import re
4
+ from typing import Any, Dict, List
5
+
6
+ import os
7
+ import numpy as np
8
+ import pandas as pd
9
+ from spacy.lang.en.stop_words import STOP_WORDS
10
+ from rapidfuzz import fuzz, process
11
+ import streamlit as st
12
+
13
+ from utils.logging_config import get_logger
14
+
15
+ logger = get_logger(__name__)
16
+
17
+
18
+ @st.cache_resource
19
+ def load_nlp():
20
+ import spacy
21
+ nlp = spacy.load("en_core_web_sm")
22
+ # Optimize: disable unused pipes
23
+ nlp.select_pipes(disable=["ner", "senter"])
24
+ logger.debug("spaCy model loaded with optimized pipes")
25
+ return nlp
26
+
27
+
28
+ @st.cache_resource
29
+ def load_embedder():
30
+ from sentence_transformers import SentenceTransformer
31
+ embedder = SentenceTransformer("all-MiniLM-L6-v2")
32
+ logger.debug("SentenceTransformer model loaded")
33
+ return embedder
34
+
35
+
36
+ @st.cache_data
37
+ def load_skills_set(csv_path: str = "skills.csv") -> List[str]:
38
+ """Read skills from a CSV file and return a deduplicated, normalized list."""
39
+ try:
40
+ df = pd.read_csv(csv_path, header=None, dtype=str, low_memory=False)
41
+ except Exception as exc:
42
+ logger.exception("Failed to read skills CSV '%s': %s", csv_path, exc)
43
+ return []
44
+
45
+ skills_set = set()
46
+ for row in df.values.flatten():
47
+ if isinstance(row, str):
48
+ for skill in row.split(","):
49
+ clean_skill = skill.strip().lower()
50
+ if clean_skill and len(clean_skill) > 2 and clean_skill not in STOP_WORDS:
51
+ skills_set.add(clean_skill)
52
+
53
+ skills = sorted(skills_set)
54
+ logger.debug("Loaded %d skills", len(skills))
55
+ return skills
56
+
57
+
58
+ def clean_text(text: str) -> str:
59
+ if not text:
60
+ return ""
61
+ text = text[:10000] # Truncate for memory safety
62
+ text = text.lower()
63
+ text = re.sub(r"\S+@\S+", " ", text)
64
+ text = re.sub(r"http\S+|www\.\S+", " ", text)
65
+ text = re.sub(r"\+?\d[\d\s\-]{7,}\d", " ", text)
66
+ text = re.sub(r"[^a-z0-9\s]", " ", text)
67
+ text = re.sub(r"\s+", " ", text).strip()
68
+ return text
69
+
70
+
71
+ def extract_skills(skills_set: List[str], text: str, fuzzy_threshold: int = 88) -> Dict[str, Any]:
72
+ if not text:
73
+ return {"dict_skills": [], "fuzzy_skills": []}
74
+
75
+ nlp = load_nlp()
76
+ doc = nlp(clean_text(text))
77
+
78
+ candidates = set([t.text.lower()
79
+ for t in doc if t.is_alpha and t.text.lower() not in STOP_WORDS])
80
+ candidates.update([chunk.text.strip().lower()
81
+ for chunk in doc.noun_chunks if 2 <= len(chunk.text.strip()) <= 40])
82
+ candidates = list(candidates)[:200] # Limit for memory/efficiency
83
+
84
+ dict_matches = set()
85
+ fuzzy_matches = set()
86
+
87
+ for cand in candidates:
88
+ if cand in skills_set:
89
+ dict_matches.add(cand)
90
+ continue
91
+
92
+ try:
93
+ res = process.extractOne(
94
+ cand, skills_set, scorer=fuzz.token_sort_ratio)
95
+ if res:
96
+ matched_skill, score, _ = res
97
+ if score >= fuzzy_threshold:
98
+ fuzzy_matches.add(matched_skill)
99
+ except Exception:
100
+ logger.debug("Fuzzy match error for candidate: %s", cand)
101
+
102
+ return {"dict_skills": sorted(dict_matches), "fuzzy_skills": sorted(fuzzy_matches)}
103
+
104
+
105
+ def get_embeddings(text: str, embedder):
106
+ """Return embedding for text, with caching."""
107
+ import hashlib
108
+ cache_dir = os.path.join(os.path.dirname(__file__), ".cache", "embeddings")
109
+ os.makedirs(cache_dir, exist_ok=True)
110
+
111
+ cleaned = clean_text(text)
112
+ key = hashlib.sha256(cleaned.encode("utf-8")).hexdigest()
113
+ cache_path = os.path.join(cache_dir, f"{key}.npy")
114
+
115
+ try:
116
+ if os.path.exists(cache_path):
117
+ logger.debug("Loading embedding from cache: %s", cache_path)
118
+ return np.load(cache_path)
119
+ except Exception:
120
+ logger.debug("Failed to load embedding cache at %s", cache_path)
121
+
122
+ emb = embedder.encode([cleaned])[0]
123
+
124
+ try:
125
+ np.save(cache_path, emb)
126
+ logger.debug("Saved embedding to cache: %s", cache_path)
127
+ except Exception:
128
+ logger.debug("Failed to save embedding cache at %s", cache_path)
129
+
130
+ return emb
131
+
132
+
133
+ def process_resume_and_job_wrapper(resume_text: str, job_text: str, nlp, embedder, skills_set):
134
+ """Wrapper for picklability in multiprocessing."""
135
+ resume_clean = clean_text(resume_text)
136
+ job_clean = clean_text(job_text)
137
+
138
+ resume_skills = extract_skills(skills_set, resume_clean)
139
+ job_skills = extract_skills(skills_set, job_clean)
140
+
141
+ resume_emb = get_embeddings(resume_clean, embedder)
142
+ job_emb = get_embeddings(job_clean, embedder)
143
+
144
+ return {
145
+ "resume_clean": resume_clean,
146
+ "job_clean": job_clean,
147
+ "resume_skills": resume_skills,
148
+ "job_skills": job_skills,
149
+ "resume_embedding": resume_emb,
150
+ "job_embedding": job_emb,
151
+ }
plot_aws_deployment.py ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import matplotlib.pyplot as plt
2
+ import matplotlib.patches as mpatches
3
+
4
+ # Create figure
5
+ fig, ax = plt.subplots(figsize=(12, 8))
6
+
7
+ # Define components and their positions
8
+ components = {
9
+ "S3\n(Storage)": (0.1, 0.8),
10
+ "Lambda\n(Text Extraction)": (0.4, 0.8),
11
+ "SageMaker\n(Embeddings & Similarity)": (0.7, 0.8),
12
+ "DynamoDB\n(Cache)": (0.7, 0.5),
13
+ "API Gateway": (0.4, 0.5),
14
+ "Lambda\n(Backend)": (0.55, 0.5),
15
+ "Streamlit\n(Frontend - EC2/Amplify)": (0.4, 0.2),
16
+ "CloudWatch\n(Logs)": (0.85, 0.65),
17
+ "CloudTrail\n(Auditing)": (0.85, 0.45)
18
+ }
19
+
20
+ # Draw boxes for components
21
+ for comp, (x, y) in components.items():
22
+ ax.add_patch(mpatches.FancyBboxPatch(
23
+ (x, y), 0.18, 0.1, boxstyle="round,pad=0.05",
24
+ edgecolor="black", facecolor="lightblue"
25
+ ))
26
+ ax.text(x + 0.09, y + 0.05, comp, ha="center", va="center", fontsize=9)
27
+
28
+ # Define connections (start -> end)
29
+ connections = [
30
+ ("S3\n(Storage)", "Lambda\n(Text Extraction)"),
31
+ ("Lambda\n(Text Extraction)", "SageMaker\n(Embeddings & Similarity)"),
32
+ ("SageMaker\n(Embeddings & Similarity)", "DynamoDB\n(Cache)"),
33
+ ("API Gateway", "Lambda\n(Backend)"),
34
+ ("Lambda\n(Backend)", "SageMaker\n(Embeddings & Similarity)"),
35
+ ("Lambda\n(Backend)", "DynamoDB\n(Cache)"),
36
+ ("Streamlit\n(Frontend - EC2/Amplify)", "API Gateway"),
37
+ ("Lambda\n(Backend)", "CloudWatch\n(Logs)"),
38
+ ("API Gateway", "CloudTrail\n(Auditing)")
39
+ ]
40
+
41
+ # Draw arrows
42
+ for start, end in connections:
43
+ x1, y1 = components[start]
44
+ x2, y2 = components[end]
45
+ ax.annotate("", xy=(x2+0.09, y2+0.05), xytext=(x1+0.09, y1+0.05),
46
+ arrowprops=dict(arrowstyle="->", lw=1.2))
47
+
48
+ # Formatting
49
+ ax.set_xlim(0, 1)
50
+ ax.set_ylim(0, 1)
51
+ ax.axis("off")
52
+ ax.set_title("AWS Architecture for Resume-JD Skill Matching System",
53
+ fontsize=14, weight="bold")
54
+
55
+ plt.show()
requirements-dev.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ pytest
2
+ flake8
requirements.txt CHANGED
Binary files a/requirements.txt and b/requirements.txt differ
 
scores.py ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from sklearn.metrics.pairwise import cosine_similarity
2
+ from utils.logging_config import get_logger
3
+ import numpy as np
4
+ from typing import Any, Dict
5
+ from collections import Counter
6
+
7
+ logger = get_logger(__name__)
8
+
9
+
10
+ def compute_similarity(resume_emb: Any, job_emb: Any) -> float:
11
+ try:
12
+ if resume_emb is None or job_emb is None:
13
+ logger.warning("One or both embeddings are None")
14
+ return 0.0
15
+ score = float(cosine_similarity([resume_emb], [job_emb])[0][0])
16
+ logger.debug("Computed cosine similarity: %s", score)
17
+ return score
18
+ except Exception as exc:
19
+ logger.exception("Failed to compute similarity: %s", exc)
20
+ return 0.0
21
+
22
+
23
+ def compute_skill_match(resume_skills: Dict[str, Any], job_skills: Dict[str, Any], job_text: str, top_n: int = 20) -> Dict[str, Any]:
24
+ try:
25
+ resume_set = set(resume_skills.get("dict_skills", []) +
26
+ resume_skills.get("fuzzy_skills", []))
27
+ job_list = job_skills.get("dict_skills", []) + \
28
+ job_skills.get("fuzzy_skills", [])
29
+ job_set = set(job_list)
30
+
31
+ overlap = resume_set & job_set
32
+ missing = job_set - resume_set
33
+
34
+ if len(job_set) == 0:
35
+ skill_score = 0.0
36
+ else:
37
+ skill_score = len(overlap) / len(job_set)
38
+
39
+ # --- Rank missing skills by frequency in job description ---
40
+ job_tokens = [t.lower() for t in job_text.split()]
41
+ freq_counter = Counter(job_tokens)
42
+
43
+ # Score missing skills by frequency in job description
44
+ ranked_missing = sorted(
45
+ missing,
46
+ key=lambda skill: freq_counter.get(skill.lower(), 0),
47
+ reverse=True
48
+ )[:top_n]
49
+
50
+ result = {
51
+ "skill_score": round(skill_score, 2),
52
+ "overlap": sorted(list(overlap)),
53
+ "missing": ranked_missing, # now limited & ranked
54
+ }
55
+ logger.debug("Computed skill match with ranking: %s", result)
56
+ return result
57
+ except Exception as exc:
58
+ logger.exception("Failed to compute skill match: %s", exc)
59
+ return {"skill_score": 0.0, "overlap": [], "missing": []}
60
+
61
+ def interpret_similarity(score: float) -> str:
62
+ try:
63
+ if score >= 0.8:
64
+ return "✅ Excellent match! You should definitely apply for this job."
65
+ elif score >= 0.65:
66
+ return "👍 Good match. You stand a strong chance — applying is recommended."
67
+ elif score >= 0.5:
68
+ return "⚠️ Partial match. Consider improving your resume by adding missing relevant skills."
69
+ else:
70
+ return "❌ Weak match. Your resume and the job description differ significantly. Tailoring your resume is highly recommended."
71
+ except Exception as exc:
72
+ logger.exception("Failed to interpret similarity score: %s", exc)
73
+ return "Score interpretation unavailable."
74
+
screenshot of skeegap page.png ADDED
skee_gap.log ADDED
The diff for this file is too large to render. See raw diff
 
skee_gap_architecture.png ADDED
skeegap mermaid chart.png ADDED

Git LFS Details

  • SHA256: 59c4194163b04e73c3935d0a0d87cd03726bbf1d698325cb31a2f371f307a1d7
  • Pointer size: 131 Bytes
  • Size of remote file: 288 kB
skeegap results.png ADDED

Git LFS Details

  • SHA256: 4f011c20618011efdc7fec8d60897aad9840991e2a28597191a69d3e20015e8b
  • Pointer size: 131 Bytes
  • Size of remote file: 272 kB
skills.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aecb68c83997b763a7014ff908c346dabbb1a2f7486fc3db02087fbab9a269c9
3
+ size 53615877
timeline.txt ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Timeline (1–2 Weeks)
2
+ Week 1
3
+
4
+ Day 1–2 → Setup project, resume/job input working.
5
+
6
+ Day 3–4 → Implement NLP pipeline (skills extraction, embeddings).
7
+
8
+ Day 5 → Add similarity scoring + results visualization.
9
+
10
+
11
+
12
+ Week 2
13
+
14
+ Day 6–7 → Build Streamlit front-end.
15
+
16
+ Day 8 → Deploy to Streamlit Cloud.
17
+
18
+ Day 9 → Test with your own resumes + real job postings.
19
+
20
+ Day 10 → Polish visuals + record a short demo.
21
+
22
+ Day 11–12 → Write engaging LinkedIn launch post.
23
+
24
+
25
+
26
+ 🔹 Step 2: NLP Preprocessing & Skill Extraction
27
+ Goal:
28
+
29
+ Prepare resume & job description text for comparison and scoring. Extract relevant skills, keywords, and entities to build features for similarity analysis.
30
+
31
+ 1️⃣ Text Cleaning & Normalization
32
+
33
+ Objectives: Remove noise, standardize text.
34
+
35
+ Actions:
36
+
37
+ Lowercase all text.
38
+
39
+ Remove emails, URLs, phone numbers.
40
+
41
+ Remove special characters, extra whitespace.
42
+
43
+ Optional: lemmatization (convert words to root form).
44
+
45
+ Python Tools: re for regex, nltk or spaCy for tokenization & lemmatization.
46
+
47
+ 2️⃣ Tokenization & Stopword Removal
48
+
49
+ Objectives: Split text into words or phrases and remove unimportant/common words.
50
+
51
+ Actions:
52
+
53
+ Tokenize text using spaCy or nltk.word_tokenize.
54
+
55
+ Remove stopwords (the, a, and, of...).
56
+
57
+ Optional: remove very short tokens (<2 characters).
58
+
59
+ 3️⃣ Skill/Keyword Extraction
60
+
61
+ Two approaches:
62
+
63
+ A. Predefined Skills Dictionary
64
+
65
+ Build or use a skills list (from datasets like skills-ner or StackOverflow/LinkedIn skill lists).
66
+
67
+ Match resume & job description text against this list (case-insensitive substring match or fuzzy match).
68
+
69
+ Output: List of skills found in resume and in job description.
70
+
71
+ B. NLP-based Extraction
72
+
73
+ Use NER (Named Entity Recognition) to detect skills/entities in text.
74
+
75
+ spaCy pretrained model (en_core_web_sm) can identify ORG, WORK_OF_ART, etc.
76
+
77
+ Optional: train a custom NER model for skills (advanced).
78
+
79
+ Use noun chunk extraction to find phrases like “data analysis”, “machine learning”, “Python programming”.
80
+
81
+ 4️⃣ Feature Vectorization / Embeddings
82
+
83
+ Convert resume & job description text into numerical form for similarity analysis.
84
+
85
+ Options:
86
+
87
+ TF-IDF Vectors (basic, interpretable).
88
+
89
+ Sentence Embeddings using sentence-transformers (state-of-the-art).
90
+
91
+ Example model: all-MiniLM-L6-v2.
92
+
93
+ Generates 384-dimensional vectors for each text.
94
+
95
+ 5️⃣ Output from Step 2
96
+
97
+ After Step 2, you’ll have:
98
+
99
+ Cleaned text for resume & job description.
100
+
101
+ List of skills/keywords extracted from each.
102
+
103
+ Vector embeddings for similarity scoring.
104
+
105
+ Next Steps After Step 2
106
+
107
+ Step 3 → Similarity & Scoring Engine:
108
+
109
+ Compute cosine similarity between resume & job embeddings.
110
+
111
+ Compare skills lists → compute missing vs matched skills.
112
+
113
+ Output overall match score.