Commit b0249e4

authored

Merge pull request #61 from DevMahmoud10/main

add: links extractor automation script

2 parents 5725072 + cc6964b commit b0249e4Copy full SHA for b0249e4

File tree

+90

-0

lines changed

+90

-0

lines changed

Lines changed: 13 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,13 @@`
	`1`	`+# Links Extractor`
	`2`	`+`
	`3`	`+## Objective`
	`4`	+This script automate extracting URLs from any ```.txt``` file content based on regex expression then exporting the extracted urls in ```.txt``` output file separated by line separator.
	`5`	`+## Sample`
	`6`	+- Sample input available in ```sample/sample_text_file.txt```
	`7`	+- Sample output available in ```sample/sample_text_file_links.txt```
	`8`	`+## Requirements`
	`9`	+```pip install requirements.txt```
	`10`	`+## How to run the script?`
	`11`	+```
	`12`	`+python links_extractor.py file_name.txt`
	`13`	+```

Lines changed: 57 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,57 @@`
	`1`	`+import re`
	`2`	`+import sys`
	`3`	`+`
	`4`	`+`
	`5`	`+def get_urls(file_path):`
	`6`	`+ """[start method to fire extracting urls process]`
	`7`	`+`
	`8`	`+ Arguments:`
	`9`	`+ file_path {[str]} -- [target text file path]`
	`10`	`+ """`
	`11`	`+ text = read_text_file(file_path)`
	`12`	`+ urls = extract_urls(text)`
	`13`	`+ export_urls(urls, file_path)`
	`14`	`+`
	`15`	`+`
	`16`	`+def read_text_file(file_path):`
	`17`	`+ """[summary]`
	`18`	`+`
	`19`	`+ Arguments:`
	`20`	`+ file_path {[str]} -- [target text file path]`
	`21`	`+`
	`22`	`+ Returns:`
	`23`	`+ [str] -- [file content to works on]`
	`24`	`+ """`
	`25`	`+ with open(file_path) as f:`
	`26`	`+ text = f.read()`
	`27`	`+ return text`
	`28`	`+`
	`29`	`+`
	`30`	`+def extract_urls(text):`
	`31`	`+ """[summary]`
	`32`	`+`
	`33`	`+ Arguments:`
	`34`	`+ text {[str]} -- [file content to works on]`
	`35`	`+`
	`36`	`+ Returns:`
	`37`	`+ [list] -- [extracted urls]`
	`38`	`+ """`
	`39`	`+ url_regex_pattern = r"(?:(?:https?\|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+"`
	`40`	`+ urls = re.findall(url_regex_pattern, text)`
	`41`	`+ return urls`
	`42`	`+`
	`43`	`+`
	`44`	`+def export_urls(urls, file_path):`
	`45`	`+ """[summary]`
	`46`	`+`
	`47`	`+ Arguments:`
	`48`	`+ urls {[list]} -- [extracted urls]`
	`49`	`+ file_path {[str]} -- [result text file path]`
	`50`	`+ """`
	`51`	`+ with open(file_path.replace(".txt", "_links.txt"), "w") as f:`
	`52`	`+ f.write("\n".join(urls))`
	`53`	`+`
	`54`	`+`
	`55`	`+if __name__ == "__main__":`
	`56`	`+ file_path = sys.argv[1]`
	`57`	`+ get_urls(file_path)`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+regex==2020年9月27日`

Lines changed: 13 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,13 @@`
	`1`	`+New album 'Heart To Mouth" is out now: https://lp.lnk.to/HeartToMouthID`
	`2`	`+`
	`3`	`+`
	`4`	`+Lost On You: http://smarturl.it/LostOnYouAlbum`
	`5`	`+`
	`6`	`+----------------------------------`
	`7`	`+`
	`8`	`+Website: http://iamlp.com`
	`9`	`+Facebook: http://facebook.com/iamLP`
	`10`	`+Twitter: http://twitter.com/iamlp`
	`11`	`+Soundcloud: https://soundcloud.com/iamlpmusic`
	`12`	`+Suggested by WMG`
	`13`	`+LP - Muddy Waters [Live Session]`

Lines changed: 6 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,6 @@`
	`1`	`+https://lp.lnk.to/HeartToMouthID`
	`2`	`+http://smarturl.it/LostOnYouAlbum`
	`3`	`+http://iamlp.com`
	`4`	`+http://facebook.com/iamLP`
	`5`	`+http://twitter.com/iamlp`
	`6`	`+https://soundcloud.com/iamlpmusic`

Comments

(0)