Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit f0a5f6e

Browse files
Merge pull request avinashkranjan#2053 from mkswagger/master
[GSSOC'23] Added resume parser using python and pdfminer
2 parents dfe0df2 + 7fff95c commit f0a5f6e

24 files changed

+298
-0
lines changed

‎Resume_parser/README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Resume Parser
2+
3+
The Resume Parser is a Python script that extracts relevant information such as educational background, work experience, and skills from a resume in PDF format.
4+
5+
## How It Works
6+
7+
The Resume Parser follows these steps to extract information from a resume:
8+
9+
1. The PDF file is opened and processed using the `pdfminer` library, which extracts the text content from each page of the PDF.
10+
11+
2. The extracted text is stored as a string.
12+
13+
3. Regular expressions are used to search for patterns and extract the educational background and work experience sections from the resume text. These regular expressions can be customized in the `extract_education` and `extract_experience` functions of the `newparser.py` file.
14+
15+
4. If a `skills_list.csv` file is provided, the script reads the file and creates a list of skills to search for in the resume. Each skill should be placed on a separate line in the CSV file.
16+
17+
5. The script searches for each skill in the resume text using case-insensitive matching. If a skill is found, it is added to the list of extracted skills.
18+
19+
6. The extracted educational background, work experience, and skills are displayed in the console output.
20+
21+
<img width="205" alt="image" src="https://github.com/mkswagger/Amazing-Python-Scripts/assets/34826479/3600c2f8-2fea-436d-8679-327e2ecdea81">
22+
23+
24+
## Usage
25+
26+
1. Clone the repository to your local machine:
27+
28+
```shell
29+
git clone https://github.com/mkswagger/Amazing-Python-Scripts/tree/master/Resume_parser
30+
31+
2. Place the resume PDF file you want to parse in the project directory.
32+
33+
3. Modify the `file_name` variable in the `newparser.py` file to match the name of your resume file.
34+
35+
4. Optionally, if you have a wide range of skills to extract, create a CSV file named `skills_list.csv` in the project directory. Each skill should be placed on a separate line.
36+
37+
5. Run the script:
38+
39+
```shell
40+
python newparser.py
41+
42+
The script will extract the educational background, work experience, and skills from the resume and display them in the console.
43+
Feel free to customize the regular expressions and add additional extraction logic based on your specific requirements.
44+

‎Resume_parser/requirements.txt

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
altair==5.0.1
2+
asgiref==3.6.0
3+
async-generator==1.10
4+
attrs==23.1.0
5+
beautifulsoup4==4.12.2
6+
bitarray==2.7.3
7+
blinker==1.6.2
8+
blis==0.7.9
9+
breadability==0.1.20
10+
bs4==0.0.1
11+
cachetools==5.3.1
12+
catalogue==2.0.8
13+
certifi==2022年12月7日
14+
cffi==1.15.1
15+
chardet==5.1.0
16+
charset-normalizer==3.1.0
17+
click==8.1.3
18+
colorama==0.4.6
19+
confection==0.0.4
20+
cssselect==1.2.0
21+
cymem==2.0.7
22+
decorator==5.1.1
23+
Django==4.1.7
24+
doc2text==0.2.4
25+
docopt==0.6.2
26+
docx==0.2.4
27+
et-xmlfile==1.1.0
28+
exceptiongroup==1.1.1
29+
feedfinder2==0.0.4
30+
feedparser==6.0.10
31+
filelock==3.12.2
32+
fsspec==202360
33+
future==0.18.3
34+
gensim==4.3.1
35+
gitdb==4.0.10
36+
GitPython==3.1.31
37+
greenlet==2.0.2
38+
h11==0.14.0
39+
huggingface-hub==0.15.1
40+
idna==3.4
41+
importlib-metadata==6.7.0
42+
jieba3k==0.35.1
43+
Jinja2==3.1.2
44+
joblib==1.2.0
45+
jsonschema==4.17.3
46+
langcodes==3.3.0
47+
lxml==4.9.2
48+
markdown-it-py==3.0.0
49+
MarkupSafe==2.1.2
50+
mdurl==0.1.2
51+
mime==0.1.0
52+
murmurhash==1.0.9
53+
newspaper3k==0.2.8
54+
nltk==3.8.1
55+
numpy==1.24.2
56+
openpyxl==3.1.2
57+
outcome==1.2.0
58+
packaging==23.0
59+
pandas==1.5.3
60+
parse==1.19.0
61+
pathy==0.10.1
62+
pdfminer==20191125
63+
pdfreader==0.1.12
64+
Pillow==9.4.0
65+
playwright==1.34.0
66+
preshed==3.0.8
67+
protobuf==4.23.3
68+
pyarrow==12.0.1
69+
pycountry==22.3.5
70+
pycparser==2.21
71+
pycryptodome==3.17
72+
pydantic==1.10.6
73+
pydeck==0.8.1b0
74+
pyee==9.0.4
75+
Pygments==2.15.1
76+
Pympler==1.0.1
77+
PyPDF2==3.0.1
78+
pyrsistent==0.19.3
79+
PySocks==1.7.1
80+
pytesseract==0.3.10
81+
python-dateutil==2.8.2
82+
python-docx==0.8.11
83+
pytz==2022年7月1日
84+
pytz-deprecation-shim==0.1.0.post0
85+
PyYAML==6.0
86+
regex==2023年6月3日
87+
requests==2.28.2
88+
requests-file==1.5.1
89+
rich==13.4.2
90+
safetensors==0.3.1
91+
scikit-learn==1.2.2
92+
scipy==1.10.1
93+
selenium==4.10.0
94+
sgmllib3k==1.0.0
95+
six==1.16.0
96+
sklearn==0.0.post1
97+
smart-open==6.3.0
98+
smmap==5.0.0
99+
sniffio==1.3.0
100+
sortedcontainers==2.4.0
101+
soupsieve==2.4.1
102+
spacy==3.5.1
103+
spacy-legacy==3.0.12
104+
spacy-loggers==1.0.4
105+
sqlparse==0.4.3
106+
srsly==2.4.6
107+
streamlit==1.23.1
108+
sumy==0.11.0
109+
tenacity==8.2.2
110+
thinc==8.1.9
111+
threadpoolctl==3.1.0
112+
tika==2.6.0
113+
tinysegmenter==0.3
114+
tldextract==3.4.4
115+
tokenizers==0.13.3
116+
toml==0.10.2
117+
toolz==0.12.0
118+
tornado==6.3.2
119+
tqdm==4.65.0
120+
transformers==4.30.2
121+
trio==0.22.0
122+
trio-websocket==0.10.3
123+
typer==0.7.0
124+
typing_extensions==4.5.0
125+
tzdata==2022.7
126+
tzlocal==4.3
127+
urllib3==1.26.15
128+
validators==0.20.0
129+
wasabi==1.1.1
130+
watchdog==3.0.0
131+
wsproto==1.2.0
132+
zipp==3.15.0

‎Resume_parser/resumeparser.py

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
import re
2+
import csv
3+
from pdfminer.pdfpage import PDFPage
4+
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
5+
from pdfminer.converter import TextConverter
6+
from pdfminer.layout import LAParams
7+
import io
8+
9+
def extract_education(resume_text):
10+
education_pattern = r"((?:Bachelor|Master|Ph\.?D|Diploma)[^.,]*\b(?:\.\b)?(?:[^.,\n]*\b(?:University|College|School|Institute)\b[^.,\n]*)?)"
11+
education_matches = re.findall(education_pattern, resume_text, re.IGNORECASE)
12+
return education_matches
13+
14+
def extract_experience(resume_text):
15+
experience_pattern = r"(?:(?:[A-Z][a-z]+\s+){1,3})?(?:(?:\d{4}\s?-\s?\d{4}|\d{4})\s)?(?:(?:Present|Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[A-Za-z\s]+\d{4})"
16+
experience_matches = re.findall(experience_pattern, resume_text, re.IGNORECASE)
17+
return experience_matches
18+
19+
def extract_skills(resume_text, skills_list):
20+
skills_found = []
21+
for skill in skills_list:
22+
escaped_skill = re.escape(skill)
23+
if re.search(r'\b{}\b'.format(escaped_skill), resume_text, re.IGNORECASE):
24+
skills_found.append(skill)
25+
return skills_found
26+
27+
28+
file_name = "resumes\Resume_12.pdf"
29+
skills_file = "skills_list.csv" # Path to the CSV file containing skills
30+
i_f = open(file_name, 'rb')
31+
res_mgr = PDFResourceManager()
32+
ret_data = io.StringIO()
33+
txt_converter = TextConverter(res_mgr, ret_data, laparams=LAParams())
34+
interpreter = PDFPageInterpreter(res_mgr, txt_converter)
35+
for page in PDFPage.get_pages(i_f):
36+
interpreter.process_page(page)
37+
resume_text = ret_data.getvalue()
38+
39+
# Extract educational and work experience
40+
education = extract_education(resume_text)
41+
experience = extract_experience(resume_text)
42+
43+
# Extract skills from CSV file
44+
skills_list = []
45+
with open(skills_file, 'r') as csv_file:
46+
reader = csv.reader(csv_file)
47+
for row in reader:
48+
skills_list.extend(row)
49+
50+
# Extract skills
51+
skills = extract_skills(resume_text, skills_list)
52+
53+
# Print the extracted information
54+
print("Educational Background:")
55+
for edu in education:
56+
print(edu)
57+
58+
print("\nWork Experience:")
59+
for exp in experience:
60+
print(exp)
61+
62+
print("\nSkills:")
63+
for skill in skills:
64+
print(skill)
65+
66+
# Close the file and converter
67+
i_f.close()
68+
txt_converter.close()

‎Resume_parser/resumes/Resume_12.pdf

42.1 KB
Binary file not shown.

‎Resume_parser/resumes/resume_1.pdf

48.2 KB
Binary file not shown.

‎Resume_parser/resumes/resume_10.pdf

909 KB
Binary file not shown.

‎Resume_parser/resumes/resume_11.pdf

341 KB
Binary file not shown.

‎Resume_parser/resumes/resume_13.pdf

87.4 KB
Binary file not shown.

‎Resume_parser/resumes/resume_14.pdf

5.78 MB
Binary file not shown.

‎Resume_parser/resumes/resume_15.pdf

105 KB
Binary file not shown.

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /