Spaces:

asoyewole
/

skeegap

Sleeping

App Files Files Community

asoyewole commited on Sep 13

Commit

15b4e3f

verified ·

1 Parent(s): a5c48c1

upload project copy for huggingface deployment

Browse files

Files changed (18) hide show

.gitattributes +3 -0
.gitignore +5 -0
Flowchart of aws deployment.png +0 -0
README.md +142 -20
app.py +131 -0
architecture_diagram.py +30 -0
extract_skills.py +151 -0
plot_aws_deployment.py +55 -0
requirements-dev.txt +2 -0
requirements.txt +0 -0
scores.py +74 -0
screenshot of skeegap page.png +0 -0
skee_gap.log +0 -0
skee_gap_architecture.png +0 -0
skeegap mermaid chart.png +3 -0
skeegap results.png +3 -0
skills.csv +3 -0
timeline.txt +113 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+skeegap[[:space:]]mermaid[[:space:]]chart.png filter=lfs diff=lfs merge=lfs -text
+skeegap[[:space:]]results.png filter=lfs diff=lfs merge=lfs -text
+skills.csv filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,5 @@

+skee_gap_env
+__pycache__
+static
+backup
+*.pdf

Flowchart of aws deployment.png ADDED Viewed

README.md CHANGED Viewed

@@ -1,20 +1,142 @@
----
-title: Skeegap
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
-pinned: false
-short_description: Resume analyser for comparing resumes with job descriptions
-license: mit
----
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

+# Project Skee Gap
+## Resume-to-Job Skill Matcher
+A resume and job description skill extraction and matching tool that leverages Natural Language Processing (NLP), Sentence Transformers, and fuzzy matching to help candidates and recruiters quickly identify how well a resume aligns with a given job description.
+This project provides a Streamlit-based web application where users can:
+1. Upload or paste a resume (.pdf, txt, or .docx)
+2. Upload or Paste a job description
+3. Extract and match skills against a skills knowledge base (skills.csv)
+4. View expandable sections of Matched Skills and Missing Skills
+5. Gain insights into resume-job fit in an interactive and user-friendly manner
+🔑 Key Features
+1. Skill Extraction from Resume:
+   Uses spaCy and regex rules to extract relevant skills from the resume text.
+2. Job Description Skill Extraction:
+   Identifies required skills from the job description using NLP and similarity scoring.
+3. Fuzzy Matching with Skills Dataset:
+   Even if the exact skill wording differs, fuzzy string matching ensures relevant skills are captured (e.g., “PyTorch” vs. “Torch”).
+4. Sentence Transformer Embeddings:
+   Improves semantic matching of skills by comparing embeddings rather than just keywords.
+5. Expandable Matched/Missing Skills Panels:
+   Avoids screen clutter by hiding detailed lists until the user chooses to expand them.
+6. Interactive Streamlit App:
+   Clean, user-friendly interface to upload resumes, paste job descriptions, and instantly view results.
+📂 Project Structure
+resume-skill-matcher/
+│
+├── app.py # Streamlit main app
+├── extract_skills.py # Skill extraction + fuzzy/semantic matching logic
+├── skills.csv # Master skills dataset (expandable/customizable)
+├── requirements.txt # Project dependencies
+├── README.md # Documentation (this file)
+└── sample_resumes/ # (Optional) Example resumes for testing
+├── scores.py # compute the scores, compares
+└── extract_text.py # Extract text from resume and job desc.
+🏗️ Architecture Overview
+flowchart TD
+A[User Uploads Resume] -->|PDF/DOCX Parsing| B[Resume Text Extraction]
+C[User Pastes Job Description] --> D[Job Description Text Extraction]
+    B --> E[Skill Extraction via spaCy + Regex]
+    D --> F[Skill Extraction via spaCy + Regex]
+    E --> G[Fuzzy Matching with skills.csv]
+    F --> G
+    G --> H[Sentence Transformer Embeddings for Semantic Similarity]
+    H --> I[Matched Skills / Missing Skills Classification]
+    I --> J[Streamlit Expandable Panels]
+🧩 Skills Dataset (skills.csv)
+The skills dataset use in the project is from Skill2vec (2017).
+@article{van2017skill2vec,
+title={Skill2vec: Machine Learning Approach for Determining the Relevant Skills from Job Description},
+author={Van-Duyet, Le and Quan, Vo Minh and An, Dang Quang},
+journal={arXiv preprint arXiv:1707.09751},
+year={2017}
+}
+The file skills.csv contains the reference set of skills used for matching. You can customize it for different industries.
+Example structure:
+skill
+Python
+SQL
+Data Analysis
+Machine Learning
+Deep Learning
+Project Management
+Docker
+AWS
+You can expand this dataset to cover domain-specific skill sets (e.g., finance, healthcare, cybersecurity).
+📊 Example Workflow
+Upload resume.pdf
+Paste job description into the app
+Click Analyze
+View results:
+✅ Matched Skills (expandable list)
+❌ Missing Skills (expandable list)
+Use insights to improve your resume or assess candidate fit.
+🚀 Roadmap / Future Enhancements
+Add resume scoring system (percentage match score)
+Generate recommendations for missing skills
+Extend support for multi-page resumes and multiple job postings
+Add export to PDF/Excel feature for results
+Enhance semantic similarity using large language models (LLMs)
+Integrate with LinkedIn or job boards for automatic skill extraction
+📜 License
+This project is licensed under the MIT License.
+🙌 Acknowledgments
+spaCy for NLP pipelines
+Sentence Transformers for semantic similarity
+RapidFuzz for efficient fuzzy string matching
+Streamlit for building the interactive web app
+@article{van2017skill2vec,
+title={Skill2vec: Machine Learning Approach for Determining the Relevant Skills from Job Description},
+author={Van-Duyet, Le and Quan, Vo Minh and An, Dang Quang},
+journal={arXiv preprint arXiv:1707.09751},
+year={2017}
+}

app.py ADDED Viewed

	@@ -0,0 +1,131 @@

+import streamlit as st
+from utils.extract_text import extract_text_from_pdf, extract_text_from_docx, extract_text_from_txt
+from extract_skills import process_resume_and_job_wrapper, load_nlp, load_embedder, load_skills_set
+from scores import compute_similarity, compute_skill_match, interpret_similarity
+import gc
+import psutil
+from concurrent.futures import ProcessPoolExecutor
+st.set_page_config(page_title="Resume Analyzer", layout="wide")
+st.title("📊 Resume vs Job Description Analyzer")
+MIN_CHAR_COUNT = 300
+# --- Layout: 2 columns ---
+col1, col2 = st.columns(2)
+# Resume Input (left)
+with col1:
+    st.subheader("Resume")
+    resume_text = ""
+    resume_file = st.file_uploader(
+        "Upload Resume (PDF/DOCX/TXT)", type=["pdf", "docx", "txt"], key="resume")
+    if resume_file:
+        if resume_file.type == "application/pdf":
+            resume_text = extract_text_from_pdf(resume_file)
+        elif resume_file.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
+            resume_text = extract_text_from_docx(resume_file)
+        elif resume_file.type == "text/plain":
+            resume_text = extract_text_from_txt(resume_file)
+    resume_text_paste = st.text_area(
+        "Or paste resume text here", height=200, key="resume_paste")
+    if resume_text_paste.strip():
+        resume_text = resume_text_paste
+    if resume_text:
+        st.write(f"**Characters:** {len(resume_text)}")
+        if len(resume_text) < MIN_CHAR_COUNT:
+            st.warning(
+                "⚠️ Resume text seems too short, extraction may have failed.")
+# Job Input (right)
+with col2:
+    st.subheader("Job Description")
+    job_text = ""
+    job_file = st.file_uploader(
+        "Upload Job Description (PDF/DOCX/TXT)", type=["pdf", "docx", "txt"], key="job")
+    if job_file:
+        if job_file.type == "application/pdf":
+            job_text = extract_text_from_pdf(job_file)
+        elif job_file.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
+            job_text = extract_text_from_docx(job_file)
+        elif job_file.type == "text/plain":
+            job_text = extract_text_from_txt(job_file)
+    job_text_paste = st.text_area(
+        "Or paste job description here", height=200, key="job_paste")
+    if job_text_paste.strip():
+        job_text = job_text_paste
+    if job_text:
+        st.write(f"**Characters:** {len(job_text)}")
+        if len(job_text) < MIN_CHAR_COUNT:
+            st.warning(
+                "⚠️ Job description seems too short, extraction may have failed.")
+# --- Centered Analyze Button ---
+col1, col2, col3 = st.columns([1, 1, 1])
+with col2:
+    col1, col2, col3 = st.columns([1, 1, 1])
+    with col2:
+        analyze_button = st.button("🔍 Compare")
+# Initialize session state for analyzing flag
+if 'analyzing' not in st.session_state:
+    st.session_state.analyzing = False
+# --- Action ---
+if analyze_button:
+    if not resume_text or not job_text:
+        st.error("Please provide both resume and job description.")
+    elif st.session_state.analyzing:
+        st.warning("Another analysis is in progress. Please wait.")
+    else:
+        st.session_state.analyzing = True
+        try:
+            with st.spinner("🔄 Analyzing your resume ↔ job description... 🚀"):
+                # Offload to subprocess for non-blocking
+                with ProcessPoolExecutor(max_workers=1) as executor:
+                    future = executor.submit(
+                        process_resume_and_job_wrapper,
+                        resume_text,
+                        job_text,
+                        load_nlp(),
+                        load_embedder(),
+                        load_skills_set()
+                    )
+                    results = future.result()
+                similarity_score = compute_similarity(
+                    results["resume_embedding"], results["job_embedding"])
+                skill_match = compute_skill_match(
+                    results["resume_skills"], results["job_skills"], job_text, top_n=30)
+            # Results Header
+            st.subheader("📈 Results")
+            st.metric("Cosine Similarity Score", f"{similarity_score:.2f}")
+            st.info(interpret_similarity(similarity_score))
+            # --- Split Results into 2 Columns ---
+            res_col1, res_col2 = st.columns(2)
+            with res_col1:
+                with st.expander("✅ Matched Skills: click to expand"):
+                    st.success(f"{len(skill_match['overlap'])} skills matched")
+                    st.write(", ".join(
+                        skill_match["overlap"]) if skill_match["overlap"] else "No matched skills found")
+            with res_col2:
+                with st.expander("❌ Top Missing Skills: click to expand"):
+                    st.error("skills missing")
+                    st.write(", ".join(
+                        skill_match["missing"]) if skill_match["missing"] else "No missing skills found")
+            # Memory monitoring
+            st.write(
+                f"Memory usage after analysis: {psutil.Process().memory_info().rss / (1024 ** 2):.2f} MB")
+        finally:
+            st.session_state.analyzing = False
+            gc.collect()  # Force garbage collection

architecture_diagram.py ADDED Viewed

	@@ -0,0 +1,30 @@

+from graphviz import Digraph
+# Create Digraph
+dot = Digraph("ResumeAnalyzer", format="png")
+dot.attr(rankdir="TB", size="8")
+# Nodes
+dot.node("input", "User Input\n- Upload Resume (PDF)\n- Paste/Upload Job Description", shape="box", style="filled", fillcolor="#cce5ff")
+dot.node("preprocess", "Data Preprocessing\n- Extract & Clean Text\n- Tokenization\n- Skills/NER Extraction", shape="box", style="filled", fillcolor="#e2e3e5")
+dot.node("embedding", "Embedding & Feature Extraction\n- Sentence Transformers\n- Resume Embedding\n- JD Embedding\n- Skills Dictionary", shape="box", style="filled", fillcolor="#d4edda")
+dot.node("scoring", "Matching & Scoring Engine\n- Cosine Similarity\n- Skill Overlap\n- Match Score (0-100%)", shape="box", style="filled", fillcolor="#fff3cd")
+dot.node("viz", "Visualization & Results\n- Score Gauge\n- Matched/Missing Skills\n- Wordcloud / Venn Diagram\n- Feedback Report", shape="box", style="filled", fillcolor="#f8d7da")
+dot.node("webapp", "Web App Layer (Streamlit)\n- Interactive UI\n- Upload Widgets\n- Real-time Scoring\n- Public URL", shape="box", style="filled", fillcolor="#d1ecf1")
+# Edges
+dot.edges([("input", "preprocess"),
+           ("preprocess", "embedding"),
+           ("embedding", "scoring"),
+           ("scoring", "viz"),
+           ("viz", "webapp")])
+# Render diagram
+file_path = "./skee_gap_architecture"
+dot.render(file_path, format="png", cleanup=True)
+file_path + ".png"

extract_skills.py ADDED Viewed

	@@ -0,0 +1,151 @@

+from __future__ import annotations
+import re
+from typing import Any, Dict, List
+import os
+import numpy as np
+import pandas as pd
+from spacy.lang.en.stop_words import STOP_WORDS
+from rapidfuzz import fuzz, process
+import streamlit as st
+from utils.logging_config import get_logger
+logger = get_logger(__name__)
+@st.cache_resource
+def load_nlp():
+    import spacy
+    nlp = spacy.load("en_core_web_sm")
+    # Optimize: disable unused pipes
+    nlp.select_pipes(disable=["ner", "senter"])
+    logger.debug("spaCy model loaded with optimized pipes")
+    return nlp
+@st.cache_resource
+def load_embedder():
+    from sentence_transformers import SentenceTransformer
+    embedder = SentenceTransformer("all-MiniLM-L6-v2")
+    logger.debug("SentenceTransformer model loaded")
+    return embedder
+@st.cache_data
+def load_skills_set(csv_path: str = "skills.csv") -> List[str]:
+    """Read skills from a CSV file and return a deduplicated, normalized list."""
+    try:
+        df = pd.read_csv(csv_path, header=None, dtype=str, low_memory=False)
+    except Exception as exc:
+        logger.exception("Failed to read skills CSV '%s': %s", csv_path, exc)
+        return []
+    skills_set = set()
+    for row in df.values.flatten():
+        if isinstance(row, str):
+            for skill in row.split(","):
+                clean_skill = skill.strip().lower()
+                if clean_skill and len(clean_skill) > 2 and clean_skill not in STOP_WORDS:
+                    skills_set.add(clean_skill)
+    skills = sorted(skills_set)
+    logger.debug("Loaded %d skills", len(skills))
+    return skills
+def clean_text(text: str) -> str:
+    if not text:
+        return ""
+    text = text[:10000]  # Truncate for memory safety
+    text = text.lower()
+    text = re.sub(r"\S+@\S+", " ", text)
+    text = re.sub(r"http\S+|www\.\S+", " ", text)
+    text = re.sub(r"\+?\d[\d\s\-]{7,}\d", " ", text)
+    text = re.sub(r"[^a-z0-9\s]", " ", text)
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+def extract_skills(skills_set: List[str], text: str, fuzzy_threshold: int = 88) -> Dict[str, Any]:
+    if not text:
+        return {"dict_skills": [], "fuzzy_skills": []}
+    nlp = load_nlp()
+    doc = nlp(clean_text(text))
+    candidates = set([t.text.lower()
+                     for t in doc if t.is_alpha and t.text.lower() not in STOP_WORDS])
+    candidates.update([chunk.text.strip().lower()
+                      for chunk in doc.noun_chunks if 2 <= len(chunk.text.strip()) <= 40])
+    candidates = list(candidates)[:200]  # Limit for memory/efficiency
+    dict_matches = set()
+    fuzzy_matches = set()
+    for cand in candidates:
+        if cand in skills_set:
+            dict_matches.add(cand)
+            continue
+        try:
+            res = process.extractOne(
+                cand, skills_set, scorer=fuzz.token_sort_ratio)
+            if res:
+                matched_skill, score, _ = res
+                if score >= fuzzy_threshold:
+                    fuzzy_matches.add(matched_skill)
+        except Exception:
+            logger.debug("Fuzzy match error for candidate: %s", cand)
+    return {"dict_skills": sorted(dict_matches), "fuzzy_skills": sorted(fuzzy_matches)}
+def get_embeddings(text: str, embedder):
+    """Return embedding for text, with caching."""
+    import hashlib
+    cache_dir = os.path.join(os.path.dirname(__file__), ".cache", "embeddings")
+    os.makedirs(cache_dir, exist_ok=True)
+    cleaned = clean_text(text)
+    key = hashlib.sha256(cleaned.encode("utf-8")).hexdigest()
+    cache_path = os.path.join(cache_dir, f"{key}.npy")
+    try:
+        if os.path.exists(cache_path):
+            logger.debug("Loading embedding from cache: %s", cache_path)
+            return np.load(cache_path)
+    except Exception:
+        logger.debug("Failed to load embedding cache at %s", cache_path)
+    emb = embedder.encode([cleaned])[0]
+    try:
+        np.save(cache_path, emb)
+        logger.debug("Saved embedding to cache: %s", cache_path)
+    except Exception:
+        logger.debug("Failed to save embedding cache at %s", cache_path)
+    return emb
+def process_resume_and_job_wrapper(resume_text: str, job_text: str, nlp, embedder, skills_set):
+    """Wrapper for picklability in multiprocessing."""
+    resume_clean = clean_text(resume_text)
+    job_clean = clean_text(job_text)
+    resume_skills = extract_skills(skills_set, resume_clean)
+    job_skills = extract_skills(skills_set, job_clean)
+    resume_emb = get_embeddings(resume_clean, embedder)
+    job_emb = get_embeddings(job_clean, embedder)
+    return {
+        "resume_clean": resume_clean,
+        "job_clean": job_clean,
+        "resume_skills": resume_skills,
+        "job_skills": job_skills,
+        "resume_embedding": resume_emb,
+        "job_embedding": job_emb,
+    }

plot_aws_deployment.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import matplotlib.pyplot as plt
+import matplotlib.patches as mpatches
+# Create figure
+fig, ax = plt.subplots(figsize=(12, 8))
+# Define components and their positions
+components = {
+    "S3\n(Storage)": (0.1, 0.8),
+    "Lambda\n(Text Extraction)": (0.4, 0.8),
+    "SageMaker\n(Embeddings & Similarity)": (0.7, 0.8),
+    "DynamoDB\n(Cache)": (0.7, 0.5),
+    "API Gateway": (0.4, 0.5),
+    "Lambda\n(Backend)": (0.55, 0.5),
+    "Streamlit\n(Frontend - EC2/Amplify)": (0.4, 0.2),
+    "CloudWatch\n(Logs)": (0.85, 0.65),
+    "CloudTrail\n(Auditing)": (0.85, 0.45)
+}
+# Draw boxes for components
+for comp, (x, y) in components.items():
+    ax.add_patch(mpatches.FancyBboxPatch(
+        (x, y), 0.18, 0.1, boxstyle="round,pad=0.05",
+        edgecolor="black", facecolor="lightblue"
+    ))
+    ax.text(x + 0.09, y + 0.05, comp, ha="center", va="center", fontsize=9)
+# Define connections (start -> end)
+connections = [
+    ("S3\n(Storage)", "Lambda\n(Text Extraction)"),
+    ("Lambda\n(Text Extraction)", "SageMaker\n(Embeddings & Similarity)"),
+    ("SageMaker\n(Embeddings & Similarity)", "DynamoDB\n(Cache)"),
+    ("API Gateway", "Lambda\n(Backend)"),
+    ("Lambda\n(Backend)", "SageMaker\n(Embeddings & Similarity)"),
+    ("Lambda\n(Backend)", "DynamoDB\n(Cache)"),
+    ("Streamlit\n(Frontend - EC2/Amplify)", "API Gateway"),
+    ("Lambda\n(Backend)", "CloudWatch\n(Logs)"),
+    ("API Gateway", "CloudTrail\n(Auditing)")
+]
+# Draw arrows
+for start, end in connections:
+    x1, y1 = components[start]
+    x2, y2 = components[end]
+    ax.annotate("", xy=(x2+0.09, y2+0.05), xytext=(x1+0.09, y1+0.05),
+                arrowprops=dict(arrowstyle="->", lw=1.2))
+# Formatting
+ax.set_xlim(0, 1)
+ax.set_ylim(0, 1)
+ax.axis("off")
+ax.set_title("AWS Architecture for Resume-JD Skill Matching System",
+             fontsize=14, weight="bold")
+plt.show()

requirements-dev.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ pytest
2	+ flake8

requirements.txt CHANGED Viewed

Binary files a/requirements.txt and b/requirements.txt differ

scores.py ADDED Viewed

	@@ -0,0 +1,74 @@

+from sklearn.metrics.pairwise import cosine_similarity
+from utils.logging_config import get_logger
+import numpy as np
+from typing import Any, Dict
+from collections import Counter
+logger = get_logger(__name__)
+def compute_similarity(resume_emb: Any, job_emb: Any) -> float:
+    try:
+        if resume_emb is None or job_emb is None:
+            logger.warning("One or both embeddings are None")
+            return 0.0
+        score = float(cosine_similarity([resume_emb], [job_emb])[0][0])
+        logger.debug("Computed cosine similarity: %s", score)
+        return score
+    except Exception as exc:
+        logger.exception("Failed to compute similarity: %s", exc)
+        return 0.0
+def compute_skill_match(resume_skills: Dict[str, Any], job_skills: Dict[str, Any], job_text: str, top_n: int = 20) -> Dict[str, Any]:
+    try:
+        resume_set = set(resume_skills.get("dict_skills", []) +
+                         resume_skills.get("fuzzy_skills", []))
+        job_list = job_skills.get("dict_skills", []) + \
+            job_skills.get("fuzzy_skills", [])
+        job_set = set(job_list)
+        overlap = resume_set & job_set
+        missing = job_set - resume_set
+        if len(job_set) == 0:
+            skill_score = 0.0
+        else:
+            skill_score = len(overlap) / len(job_set)
+        # --- Rank missing skills by frequency in job description ---
+        job_tokens = [t.lower() for t in job_text.split()]
+        freq_counter = Counter(job_tokens)
+        # Score missing skills by frequency in job description
+        ranked_missing = sorted(
+            missing,
+            key=lambda skill: freq_counter.get(skill.lower(), 0),
+            reverse=True
+        )[:top_n]
+        result = {
+            "skill_score": round(skill_score, 2),
+            "overlap": sorted(list(overlap)),
+            "missing": ranked_missing,  # now limited & ranked
+        }
+        logger.debug("Computed skill match with ranking: %s", result)
+        return result
+    except Exception as exc:
+        logger.exception("Failed to compute skill match: %s", exc)
+        return {"skill_score": 0.0, "overlap": [], "missing": []}
+def interpret_similarity(score: float) -> str:
+    try:
+        if score >= 0.8:
+            return "✅ Excellent match! You should definitely apply for this job."
+        elif score >= 0.65:
+            return "👍 Good match. You stand a strong chance — applying is recommended."
+        elif score >= 0.5:
+            return "⚠️ Partial match. Consider improving your resume by adding missing relevant skills."
+        else:
+            return "❌ Weak match. Your resume and the job description differ significantly. Tailoring your resume is highly recommended."
+    except Exception as exc:
+        logger.exception("Failed to interpret similarity score: %s", exc)
+        return "Score interpretation unavailable."

screenshot of skeegap page.png ADDED Viewed

skee_gap.log ADDED Viewed

The diff for this file is too large to render. See raw diff

skee_gap_architecture.png ADDED Viewed

skeegap mermaid chart.png ADDED Viewed

Git LFS Details

SHA256: 59c4194163b04e73c3935d0a0d87cd03726bbf1d698325cb31a2f371f307a1d7
Pointer size: 131 Bytes
Size of remote file: 288 kB

skeegap results.png ADDED Viewed

Git LFS Details

SHA256: 4f011c20618011efdc7fec8d60897aad9840991e2a28597191a69d3e20015e8b
Pointer size: 131 Bytes
Size of remote file: 272 kB

skills.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aecb68c83997b763a7014ff908c346dabbb1a2f7486fc3db02087fbab9a269c9
+size 53615877

timeline.txt ADDED Viewed

	@@ -0,0 +1,113 @@

+Timeline (1–2 Weeks)
+Week 1
+Day 1–2 → Setup project, resume/job input working.
+Day 3–4 → Implement NLP pipeline (skills extraction, embeddings).
+Day 5 → Add similarity scoring + results visualization.
+Week 2
+Day 6–7 → Build Streamlit front-end.
+Day 8 → Deploy to Streamlit Cloud.
+Day 9 → Test with your own resumes + real job postings.
+Day 10 → Polish visuals + record a short demo.
+Day 11–12 → Write engaging LinkedIn launch post.
+🔹 Step 2: NLP Preprocessing & Skill Extraction
+Goal:
+Prepare resume & job description text for comparison and scoring. Extract relevant skills, keywords, and entities to build features for similarity analysis.
+1️⃣ Text Cleaning & Normalization
+Objectives: Remove noise, standardize text.
+Actions:
+Lowercase all text.
+Remove emails, URLs, phone numbers.
+Remove special characters, extra whitespace.
+Optional: lemmatization (convert words to root form).
+Python Tools: re for regex, nltk or spaCy for tokenization & lemmatization.
+2️⃣ Tokenization & Stopword Removal
+Objectives: Split text into words or phrases and remove unimportant/common words.
+Actions:
+Tokenize text using spaCy or nltk.word_tokenize.
+Remove stopwords (the, a, and, of...).
+Optional: remove very short tokens (<2 characters).
+3️⃣ Skill/Keyword Extraction
+Two approaches:
+A. Predefined Skills Dictionary
+Build or use a skills list (from datasets like skills-ner or StackOverflow/LinkedIn skill lists).
+Match resume & job description text against this list (case-insensitive substring match or fuzzy match).
+Output: List of skills found in resume and in job description.
+B. NLP-based Extraction
+Use NER (Named Entity Recognition) to detect skills/entities in text.
+spaCy pretrained model (en_core_web_sm) can identify ORG, WORK_OF_ART, etc.
+Optional: train a custom NER model for skills (advanced).
+Use noun chunk extraction to find phrases like “data analysis”, “machine learning”, “Python programming”.
+4️⃣ Feature Vectorization / Embeddings
+Convert resume & job description text into numerical form for similarity analysis.
+Options:
+TF-IDF Vectors (basic, interpretable).
+Sentence Embeddings using sentence-transformers (state-of-the-art).
+Example model: all-MiniLM-L6-v2.
+Generates 384-dimensional vectors for each text.
+5️⃣ Output from Step 2
+After Step 2, you’ll have:
+Cleaned text for resume & job description.
+List of skills/keywords extracted from each.
+Vector embeddings for similarity scoring.
+Next Steps After Step 2
+Step 3 → Similarity & Scoring Engine:
+Compute cosine similarity between resume & job embeddings.
+Compare skills lists → compute missing vs matched skills.
+Output overall match score.