Commit f0a5f6e

authored

Merge pull request avinashkranjan#2053 from mkswagger/master

[GSSOC'23] Added resume parser using python and pdfminer

2 parents dfe0df2 + 7fff95c commit f0a5f6eCopy full SHA for f0a5f6e

File tree

24 files changed

+298

-0

lines changed

Resume_parser

24 files changed

+298

-0

lines changed

`‎Resume_parser/README.md`

Lines changed: 44 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,44 @@`
	`1`	`+# Resume Parser`
	`2`	`+`
	`3`	`+The Resume Parser is a Python script that extracts relevant information such as educational background, work experience, and skills from a resume in PDF format.`
	`4`	`+`
	`5`	`+## How It Works`
	`6`	`+`
	`7`	`+The Resume Parser follows these steps to extract information from a resume:`
	`8`	`+`
	`9`	+1. The PDF file is opened and processed using the `pdfminer` library, which extracts the text content from each page of the PDF.
	`10`	`+`
	`11`	`+2. The extracted text is stored as a string.`
	`12`	`+`
	`13`	+3. Regular expressions are used to search for patterns and extract the educational background and work experience sections from the resume text. These regular expressions can be customized in the `extract_education` and `extract_experience` functions of the `newparser.py` file.
	`14`	`+`
	`15`	+4. If a `skills_list.csv` file is provided, the script reads the file and creates a list of skills to search for in the resume. Each skill should be placed on a separate line in the CSV file.
	`16`	`+`
	`17`	`+5. The script searches for each skill in the resume text using case-insensitive matching. If a skill is found, it is added to the list of extracted skills.`
	`18`	`+`
	`19`	`+6. The extracted educational background, work experience, and skills are displayed in the console output.`
	`20`	`+`
	`21`	`+<img width="205" alt="image" src="https://github.com/mkswagger/Amazing-Python-Scripts/assets/34826479/3600c2f8-2fea-436d-8679-327e2ecdea81">`
	`22`	`+`
	`23`	`+`
	`24`	`+## Usage`
	`25`	`+`
	`26`	`+1. Clone the repository to your local machine:`
	`27`	`+`
	`28`	+ ```shell
	`29`	`+ git clone https://github.com/mkswagger/Amazing-Python-Scripts/tree/master/Resume_parser`
	`30`	`+`
	`31`	`+2. Place the resume PDF file you want to parse in the project directory.`
	`32`	`+`
	`33`	+3. Modify the `file_name` variable in the `newparser.py` file to match the name of your resume file.
	`34`	`+`
	`35`	+4. Optionally, if you have a wide range of skills to extract, create a CSV file named `skills_list.csv` in the project directory. Each skill should be placed on a separate line.
	`36`	`+`
	`37`	`+5. Run the script:`
	`38`	`+`
	`39`	+ ```shell
	`40`	`+ python newparser.py`
	`41`	`+`
	`42`	`+The script will extract the educational background, work experience, and skills from the resume and display them in the console.`
	`43`	`+Feel free to customize the regular expressions and add additional extraction logic based on your specific requirements.`
	`44`	`+`

`‎Resume_parser/requirements.txt`

Lines changed: 132 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,132 @@`
	`1`	`+altair==5.0.1`
	`2`	`+asgiref==3.6.0`
	`3`	`+async-generator==1.10`
	`4`	`+attrs==23.1.0`
	`5`	`+beautifulsoup4==4.12.2`
	`6`	`+bitarray==2.7.3`
	`7`	`+blinker==1.6.2`
	`8`	`+blis==0.7.9`
	`9`	`+breadability==0.1.20`
	`10`	`+bs4==0.0.1`
	`11`	`+cachetools==5.3.1`
	`12`	`+catalogue==2.0.8`
	`13`	`+certifi==2022年12月7日`
	`14`	`+cffi==1.15.1`
	`15`	`+chardet==5.1.0`
	`16`	`+charset-normalizer==3.1.0`
	`17`	`+click==8.1.3`
	`18`	`+colorama==0.4.6`
	`19`	`+confection==0.0.4`
	`20`	`+cssselect==1.2.0`
	`21`	`+cymem==2.0.7`
	`22`	`+decorator==5.1.1`
	`23`	`+Django==4.1.7`
	`24`	`+doc2text==0.2.4`
	`25`	`+docopt==0.6.2`
	`26`	`+docx==0.2.4`
	`27`	`+et-xmlfile==1.1.0`
	`28`	`+exceptiongroup==1.1.1`
	`29`	`+feedfinder2==0.0.4`
	`30`	`+feedparser==6.0.10`
	`31`	`+filelock==3.12.2`
	`32`	`+fsspec==202360`
	`33`	`+future==0.18.3`
	`34`	`+gensim==4.3.1`
	`35`	`+gitdb==4.0.10`
	`36`	`+GitPython==3.1.31`
	`37`	`+greenlet==2.0.2`
	`38`	`+h11==0.14.0`
	`39`	`+huggingface-hub==0.15.1`
	`40`	`+idna==3.4`
	`41`	`+importlib-metadata==6.7.0`
	`42`	`+jieba3k==0.35.1`
	`43`	`+Jinja2==3.1.2`
	`44`	`+joblib==1.2.0`
	`45`	`+jsonschema==4.17.3`
	`46`	`+langcodes==3.3.0`
	`47`	`+lxml==4.9.2`
	`48`	`+markdown-it-py==3.0.0`
	`49`	`+MarkupSafe==2.1.2`
	`50`	`+mdurl==0.1.2`
	`51`	`+mime==0.1.0`
	`52`	`+murmurhash==1.0.9`
	`53`	`+newspaper3k==0.2.8`
	`54`	`+nltk==3.8.1`
	`55`	`+numpy==1.24.2`
	`56`	`+openpyxl==3.1.2`
	`57`	`+outcome==1.2.0`
	`58`	`+packaging==23.0`
	`59`	`+pandas==1.5.3`
	`60`	`+parse==1.19.0`
	`61`	`+pathy==0.10.1`
	`62`	`+pdfminer==20191125`
	`63`	`+pdfreader==0.1.12`
	`64`	`+Pillow==9.4.0`
	`65`	`+playwright==1.34.0`
	`66`	`+preshed==3.0.8`
	`67`	`+protobuf==4.23.3`
	`68`	`+pyarrow==12.0.1`
	`69`	`+pycountry==22.3.5`
	`70`	`+pycparser==2.21`
	`71`	`+pycryptodome==3.17`
	`72`	`+pydantic==1.10.6`
	`73`	`+pydeck==0.8.1b0`
	`74`	`+pyee==9.0.4`
	`75`	`+Pygments==2.15.1`
	`76`	`+Pympler==1.0.1`
	`77`	`+PyPDF2==3.0.1`
	`78`	`+pyrsistent==0.19.3`
	`79`	`+PySocks==1.7.1`
	`80`	`+pytesseract==0.3.10`
	`81`	`+python-dateutil==2.8.2`
	`82`	`+python-docx==0.8.11`
	`83`	`+pytz==2022年7月1日`
	`84`	`+pytz-deprecation-shim==0.1.0.post0`
	`85`	`+PyYAML==6.0`
	`86`	`+regex==2023年6月3日`
	`87`	`+requests==2.28.2`
	`88`	`+requests-file==1.5.1`
	`89`	`+rich==13.4.2`
	`90`	`+safetensors==0.3.1`
	`91`	`+scikit-learn==1.2.2`
	`92`	`+scipy==1.10.1`
	`93`	`+selenium==4.10.0`
	`94`	`+sgmllib3k==1.0.0`
	`95`	`+six==1.16.0`
	`96`	`+sklearn==0.0.post1`
	`97`	`+smart-open==6.3.0`
	`98`	`+smmap==5.0.0`
	`99`	`+sniffio==1.3.0`
	`100`	`+sortedcontainers==2.4.0`
	`101`	`+soupsieve==2.4.1`
	`102`	`+spacy==3.5.1`
	`103`	`+spacy-legacy==3.0.12`
	`104`	`+spacy-loggers==1.0.4`
	`105`	`+sqlparse==0.4.3`
	`106`	`+srsly==2.4.6`
	`107`	`+streamlit==1.23.1`
	`108`	`+sumy==0.11.0`
	`109`	`+tenacity==8.2.2`
	`110`	`+thinc==8.1.9`
	`111`	`+threadpoolctl==3.1.0`
	`112`	`+tika==2.6.0`
	`113`	`+tinysegmenter==0.3`
	`114`	`+tldextract==3.4.4`
	`115`	`+tokenizers==0.13.3`
	`116`	`+toml==0.10.2`
	`117`	`+toolz==0.12.0`
	`118`	`+tornado==6.3.2`
	`119`	`+tqdm==4.65.0`
	`120`	`+transformers==4.30.2`
	`121`	`+trio==0.22.0`
	`122`	`+trio-websocket==0.10.3`
	`123`	`+typer==0.7.0`
	`124`	`+typing_extensions==4.5.0`
	`125`	`+tzdata==2022.7`
	`126`	`+tzlocal==4.3`
	`127`	`+urllib3==1.26.15`
	`128`	`+validators==0.20.0`
	`129`	`+wasabi==1.1.1`
	`130`	`+watchdog==3.0.0`
	`131`	`+wsproto==1.2.0`
	`132`	`+zipp==3.15.0`

`‎Resume_parser/resumeparser.py`

Lines changed: 68 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,68 @@`
	`1`	`+import re`
	`2`	`+import csv`
	`3`	`+from pdfminer.pdfpage import PDFPage`
	`4`	`+from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter`
	`5`	`+from pdfminer.converter import TextConverter`
	`6`	`+from pdfminer.layout import LAParams`
	`7`	`+import io`
	`8`	`+`
	`9`	`+def extract_education(resume_text):`
	`10`	`+ education_pattern = r"((?:Bachelor\|Master\|Ph\.?D\|Diploma)[^.,]\b(?:\.\b)?(?:[^.,\n]\b(?:University\|College\|School\|Institute)\b[^.,\n]*)?)"`
	`11`	`+ education_matches = re.findall(education_pattern, resume_text, re.IGNORECASE)`
	`12`	`+ return education_matches`
	`13`	`+`
	`14`	`+def extract_experience(resume_text):`
	`15`	`+ experience_pattern = r"(?:(?:[A-Z][a-z]+\s+){1,3})?(?:(?:\d{4}\s?-\s?\d{4}\|\d{4})\s)?(?:(?:Present\|Jan(?:uary)?\|Feb(?:ruary)?\|Mar(?:ch)?\|Apr(?:il)?\|May\|Jun(?:e)?\|Jul(?:y)?\|Aug(?:ust)?\|Sep(?:tember)?\|Oct(?:ober)?\|Nov(?:ember)?\|Dec(?:ember)?)[A-Za-z\s]+\d{4})"`
	`16`	`+ experience_matches = re.findall(experience_pattern, resume_text, re.IGNORECASE)`
	`17`	`+ return experience_matches`
	`18`	`+`
	`19`	`+def extract_skills(resume_text, skills_list):`
	`20`	`+ skills_found = []`
	`21`	`+ for skill in skills_list:`
	`22`	`+ escaped_skill = re.escape(skill)`
	`23`	`+ if re.search(r'\b{}\b'.format(escaped_skill), resume_text, re.IGNORECASE):`
	`24`	`+ skills_found.append(skill)`
	`25`	`+ return skills_found`
	`26`	`+`
	`27`	`+`
	`28`	`+file_name = "resumes\Resume_12.pdf"`
	`29`	`+skills_file = "skills_list.csv" # Path to the CSV file containing skills`
	`30`	`+i_f = open(file_name, 'rb')`
	`31`	`+res_mgr = PDFResourceManager()`
	`32`	`+ret_data = io.StringIO()`
	`33`	`+txt_converter = TextConverter(res_mgr, ret_data, laparams=LAParams())`
	`34`	`+interpreter = PDFPageInterpreter(res_mgr, txt_converter)`
	`35`	`+for page in PDFPage.get_pages(i_f):`
	`36`	`+ interpreter.process_page(page)`
	`37`	`+ resume_text = ret_data.getvalue()`
	`38`	`+`
	`39`	`+# Extract educational and work experience`
	`40`	`+education = extract_education(resume_text)`
	`41`	`+experience = extract_experience(resume_text)`
	`42`	`+`
	`43`	`+# Extract skills from CSV file`
	`44`	`+skills_list = []`
	`45`	`+with open(skills_file, 'r') as csv_file:`
	`46`	`+ reader = csv.reader(csv_file)`
	`47`	`+ for row in reader:`
	`48`	`+ skills_list.extend(row)`
	`49`	`+`
	`50`	`+# Extract skills`
	`51`	`+skills = extract_skills(resume_text, skills_list)`
	`52`	`+`
	`53`	`+# Print the extracted information`
	`54`	`+print("Educational Background:")`
	`55`	`+for edu in education:`
	`56`	`+ print(edu)`
	`57`	`+`
	`58`	`+print("\nWork Experience:")`
	`59`	`+for exp in experience:`
	`60`	`+ print(exp)`
	`61`	`+`
	`62`	`+print("\nSkills:")`
	`63`	`+for skill in skills:`
	`64`	`+ print(skill)`
	`65`	`+`
	`66`	`+# Close the file and converter`
	`67`	`+i_f.close()`
	`68`	`+txt_converter.close()`

`‎Resume_parser/resumes/Resume_12.pdf`

42.1 KB

Binary file not shown.

`‎Resume_parser/resumes/resume_1.pdf`

48.2 KB

Binary file not shown.

`‎Resume_parser/resumes/resume_10.pdf`

909 KB

Binary file not shown.

`‎Resume_parser/resumes/resume_11.pdf`

341 KB

Binary file not shown.

`‎Resume_parser/resumes/resume_13.pdf`

87.4 KB

Binary file not shown.

`‎Resume_parser/resumes/resume_14.pdf`

5.78 MB

Binary file not shown.

`‎Resume_parser/resumes/resume_15.pdf`

105 KB

Binary file not shown.

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit f0a5f6e

File tree

24 files changed

24 files changed

`‎Resume_parser/README.md`

`‎Resume_parser/requirements.txt`

`‎Resume_parser/resumeparser.py`

`‎Resume_parser/resumes/Resume_12.pdf`

`‎Resume_parser/resumes/resume_1.pdf`

`‎Resume_parser/resumes/resume_10.pdf`

`‎Resume_parser/resumes/resume_11.pdf`

`‎Resume_parser/resumes/resume_13.pdf`

`‎Resume_parser/resumes/resume_14.pdf`

`‎Resume_parser/resumes/resume_15.pdf`

0 commit comments