Commit 0bbdacd

committed

Added Text Extract with additional

1 parent eb8b816 commit 0bbdacdCopy full SHA for 0bbdacd

File tree

6 files changed

+114

-0

lines changed

Text_Extract_Images

6 files changed

+114

-0

lines changed

`‎Text_Extract_Images/README.md`

Lines changed: 56 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,56 @@`
	`1`	`+# Text_Extract`
	`2`	`+`
	`3`	`+[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)`
	`4`	`+`
	`5`	`+Text extraction form Images, OCR, Tesseract, Basic Image manipulation are all important yet very basic scripts.`
	`6`	`+`
	`7`	+This script uses ```pytesseract``` for text extraction from images, considering it only recognizes text and can
	`8`	+only print it, this script additionally adds a functionality to write the text in a `txt` and/or `csv` file.
	`9`	`+`
	`10`	`+## Setup instructions`
	`11`	`+`
	`12`	+- Setup a `python 3.x` virtual environment.
	`13`	+- `Activate` the environment
	`14`	+- Install the dependencies using ```pip3 install -r requirements.txt```
	`15`	`+- You are all set and the [script](text_extract.py) is Ready to run.`
	`16`	`+- Carefully follow the Instructions.`
	`17`	`+`
	`18`	`+## Further Readings`
	`19`	`+`
	`20`	`+Some newcomers for the first time struggle with Tesseract, this is a direct link to the`
	`21`	`+[installer](https://github.com/UB-Mannheim/tesseract/wiki)`
	`22`	`+`
	`23`	`+Setting up OCR can be found [here](http://bit.ly/2MClAwD)`
	`24`	`+`
	`25`	`+__PATH__ env variable can help in optimizing the code.`
	`26`	`+[This](http://bit.ly/35d3c3Q) and [this](http://bit.ly/3ba0zmZ) link will help you in order to achieve that.`
	`27`	`+`
	`28`	`+## Usage`
	`29`	`+`
	`30`	`+Just make sure that Tesseract is in proper directory, run the code according the comments and guidelines.`
	`31`	`+`
	`32`	+```
	`33`	`+Smaple -`
	`34`	`+Enter the Folder name containing Images: <Name of Folder>`
	`35`	`+Enter your desired output location: <Name of Folder>`
	`36`	+```
	`37`	`+`
	`38`	`+## Output`
	`39`	`+`
	`40`	`+Output`
	`41`	`+`
	`42`	`+![Output](img/Output.PNG)`
	`43`	`+`
	`44`	`+Image containing Text`
	`45`	`+`
	`46`	`+![Before Compression](img/Sample.PNG)`
	`47`	`+`
	`48`	`+After Extraction`
	`49`	`+`
	`50`	`+![After Backup](img/TextFile.PNG)`
	`51`	`+`
	`52`	`+`
	`53`	`+## Author(s)`
	`54`	`+`
	`55`	`+Made by [Vybhav Chaturvedi](https://www.linkedin.com/in/vybhav-chaturvedi-0ba82614a/)`
	`56`	`+`

`‎Text_Extract_Images/img/Output.PNG`

3.43 KB

Loading[フレーム]

`‎Text_Extract_Images/img/Sample.PNG`

14.3 KB

Loading[フレーム]

`‎Text_Extract_Images/img/TextFile.PNG`

13.3 KB

Loading[フレーム]

`‎Text_Extract_Images/requirements.txt`

Lines changed: 2 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+pytesseract==0.3.6`
	`2`	`+Pillow==8.0.1`

`‎Text_Extract_Images/text_extract.py`

Lines changed: 56 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,56 @@`
	`1`	`+from PIL import Image`
	`2`	`+import pytesseract as pt`
	`3`	`+import os`
	`4`	`+from pathlib import Path`
	`5`	`+`
	`6`	`+`
	`7`	`+current_location = (os.getcwd() + '\\')`
	`8`	`+`
	`9`	`+`
	`10`	`+def extract():`
	`11`	`+ """`
	`12`	`+ Function for extracting text from images.`
	`13`	`+ Additional it saves the text extracted as a txt file.`
	`14`	`+ """`
	`15`	`+`
	`16`	`+ # Enter the name of folder which contains img files`
	`17`	`+ image_location = input("Enter the Folder name containing Images: ")`
	`18`	`+ image_path = os.path.join(current_location, image_location)`
	`19`	`+`
	`20`	`+ # Enter the name of folder which would contain respective txt files`
	`21`	`+ destination = input("Enter your desired output location: ")`
	`22`	`+ destination_path = os.path.join(current_location, destination)`
	`23`	`+`
	`24`	`+ # Path to Tesseract`
	`25`	`+ tesseract_path = input("Enter the Path to Tesseract: ")`
	`26`	`+ print('\nNOTE: '`
	`27`	`+ 'It is preferable to setup the PATH variable to Tesseract, see README. \n')`
	`28`	`+`
	`29`	`+ # = r'C:\Program Files\Tesseract-OCR\tesseract'`
	`30`	`+ pt.pytesseract.tesseract_cmd = tesseract_path`
	`31`	`+`
	`32`	`+ # iterating over the images in the folder`
	`33`	`+ for imageName in os.listdir(image_path):`
	`34`	`+`
	`35`	`+ # Join the path and image name to obtain absolute path`
	`36`	`+ inputPath = os.path.join(image_path, imageName)`
	`37`	`+ img = Image.open(inputPath)`
	`38`	`+`
	`39`	`+ # OCR`
	`40`	`+ text = pt.image_to_string(img, lang="eng")`
	`41`	`+`
	`42`	`+ # Removing extensions`
	`43`	`+ img_file = Path(inputPath).stem`
	`44`	`+ print(img_file)`
	`45`	`+`
	`46`	`+ # The output text file`
	`47`	`+ text_file = img_file + ".txt"`
	`48`	`+ output_path = os.path.join(destination_path, text_file)`
	`49`	`+`
	`50`	`+ # saving the text for every image in a separate .txt file`
	`51`	`+ with open(output_path, "w") as file:`
	`52`	`+ file.write(text)`
	`53`	`+`
	`54`	`+`
	`55`	`+if __name__ == '__main__':`
	`56`	`+ extract()`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 0bbdacd

File tree

6 files changed

6 files changed

`‎Text_Extract_Images/README.md`

`‎Text_Extract_Images/img/Output.PNG`

`‎Text_Extract_Images/img/Sample.PNG`

`‎Text_Extract_Images/img/TextFile.PNG`

`‎Text_Extract_Images/requirements.txt`

`‎Text_Extract_Images/text_extract.py`

0 commit comments