Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 0bbdacd

Browse files
Added Text Extract with additional
1 parent eb8b816 commit 0bbdacd

File tree

6 files changed

+114
-0
lines changed

6 files changed

+114
-0
lines changed

‎Text_Extract_Images/README.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Text_Extract
2+
3+
[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
4+
5+
Text extraction form Images, OCR, Tesseract, Basic Image manipulation are all important yet very basic scripts.
6+
7+
This script uses ```pytesseract``` for text extraction from images, considering it only recognizes text and can
8+
only print it, this script additionally adds a functionality to write the text in a `txt` and/or `csv` file.
9+
10+
## Setup instructions
11+
12+
- Setup a `python 3.x` virtual environment.
13+
- `Activate` the environment
14+
- Install the dependencies using ```pip3 install -r requirements.txt```
15+
- You are all set and the [script](text_extract.py) is Ready to run.
16+
- Carefully follow the Instructions.
17+
18+
## Further Readings
19+
20+
Some newcomers for the first time struggle with Tesseract, this is a direct link to the
21+
[installer](https://github.com/UB-Mannheim/tesseract/wiki)
22+
23+
Setting up OCR can be found [here](http://bit.ly/2MClAwD)
24+
25+
__PATH__ env variable can help in optimizing the code.
26+
[This](http://bit.ly/35d3c3Q) and [this](http://bit.ly/3ba0zmZ) link will help you in order to achieve that.
27+
28+
## Usage
29+
30+
Just make sure that Tesseract is in proper directory, run the code according the comments and guidelines.
31+
32+
```
33+
Smaple -
34+
Enter the Folder name containing Images: <Name of Folder>
35+
Enter your desired output location: <Name of Folder>
36+
```
37+
38+
## Output
39+
40+
Output
41+
42+
![Output](img/Output.PNG)
43+
44+
Image containing Text
45+
46+
![Before Compression](img/Sample.PNG)
47+
48+
After Extraction
49+
50+
![After Backup](img/TextFile.PNG)
51+
52+
53+
## Author(s)
54+
55+
Made by [Vybhav Chaturvedi](https://www.linkedin.com/in/vybhav-chaturvedi-0ba82614a/)
56+

‎Text_Extract_Images/img/Output.PNG

3.43 KB
Loading[フレーム]

‎Text_Extract_Images/img/Sample.PNG

14.3 KB
Loading[フレーム]

‎Text_Extract_Images/img/TextFile.PNG

13.3 KB
Loading[フレーム]

‎Text_Extract_Images/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
pytesseract==0.3.6
2+
Pillow==8.0.1

‎Text_Extract_Images/text_extract.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
from PIL import Image
2+
import pytesseract as pt
3+
import os
4+
from pathlib import Path
5+
6+
7+
current_location = (os.getcwd() + '\\')
8+
9+
10+
def extract():
11+
"""
12+
Function for extracting text from images.
13+
Additional it saves the text extracted as a txt file.
14+
"""
15+
16+
# Enter the name of folder which contains img files
17+
image_location = input("Enter the Folder name containing Images: ")
18+
image_path = os.path.join(current_location, image_location)
19+
20+
# Enter the name of folder which would contain respective txt files
21+
destination = input("Enter your desired output location: ")
22+
destination_path = os.path.join(current_location, destination)
23+
24+
# Path to Tesseract
25+
tesseract_path = input("Enter the Path to Tesseract: ")
26+
print('\nNOTE: '
27+
'It is preferable to setup the PATH variable to Tesseract, see README. \n')
28+
29+
# = r'C:\Program Files\Tesseract-OCR\tesseract'
30+
pt.pytesseract.tesseract_cmd = tesseract_path
31+
32+
# iterating over the images in the folder
33+
for imageName in os.listdir(image_path):
34+
35+
# Join the path and image name to obtain absolute path
36+
inputPath = os.path.join(image_path, imageName)
37+
img = Image.open(inputPath)
38+
39+
# OCR
40+
text = pt.image_to_string(img, lang="eng")
41+
42+
# Removing extensions
43+
img_file = Path(inputPath).stem
44+
print(img_file)
45+
46+
# The output text file
47+
text_file = img_file + ".txt"
48+
output_path = os.path.join(destination_path, text_file)
49+
50+
# saving the text for every image in a separate .txt file
51+
with open(output_path, "w") as file:
52+
file.write(text)
53+
54+
55+
if __name__ == '__main__':
56+
extract()

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /