Commit d9ddb17

committed

markdown

1 parent c711705 commit d9ddb17Copy full SHA for d9ddb17

File tree

8 files changed

+3148

-0

lines changed

markdown

8 files changed

+3148

-0

lines changed

`‎markdown/20221011_1_Python_function.md`

Lines changed: 573 additions & 0 deletions

Large diffs are not rendered by default.

`‎markdown/20221012_1_Python_Crawling_with_selenium,_BeautifulSoup.md`

Lines changed: 465 additions & 0 deletions

Large diffs are not rendered by default.

`‎markdown/20221013_1_Python_File_IO_with_codecs_and_Encoding 2.md`

Lines changed: 423 additions & 0 deletions

Large diffs are not rendered by default.

`‎markdown/20221013_2_Python_crawler_with_traversal.md`

Lines changed: 150 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,150 @@`
	`1`	`+`
	`2`	`+## Python crawler with traversal 파이썬 순회 크롤러`
	`3`	`+`
	`4`	`+- 같은 양식의 페이지를 순회하면서 자료를 수집해오는 크롤러`
	`5`	`+`
	`6`	`+- 원 페이지 크롤러 제작 후 > 완성된 크롤러를 반복문에 넣어서 만든다`
	`7`	`+`
	`8`	`+> 반복을 어디부터 돌릴지에 대한 파악이 제일 중요!`
	`9`	`+`
	`10`	+```python
	`11`	`+# crwaling library import`
	`12`	`+from bs4 import BeautifulSoup`
	`13`	`+from selenium import webdriver`
	`14`	`+import requests`
	`15`	`+`
	`16`	`+# 코드 진행 지연을 위한 time 임포트`
	`17`	`+import time`
	`18`	`+`
	`19`	`+# 2022-07 이후 selenium 업데이트로 인한 XPATH 추적 시 사용하는 임포트`
	`20`	`+from selenium.webdriver.common.by import By`
	`21`	`+`
	`22`	`+# file io`
	`23`	`+import codecs`
	`24`	+```
	`25`	`+`
	`26`	`+<br>`
	`27`	`+`
	`28`	`+- 순서`
	`29`	`+`
	`30`	`+1. approach N page`
	`31`	`+`
	`32`	`+2. source code crawling`
	`33`	`+`
	`34`	`+3. parsing`
	`35`	`+`
	`36`	`+4. data extraction`
	`37`	`+`
	`38`	`+5. saving in txt file`
	`39`	`+`
	`40`	`+6. move to number 1.`
	`41`	`+`
	`42`	`+> 다음페이지 버튼 XPATH 클릭으로 페이지 넘기기`
	`43`	`+`
	`44`	`+<br>`
	`45`	`+`
	`46`	`+- 리스트 형식 페이지: \[F12\] + \[Network menu click\] > 리스트 다음 페이지 클릭`
	`47`	`+\> url 바뀌지 않아도, Network 변경사항을 \[Headers\], \[Payload\] tab에서 확인 가능`
	`48`	`+\> XPATH 구하기 가능!`
	`49`	`+`
	`50`	+```python
	`51`	`+chrome_driver = webdriver.Chrome('chromedriver')`
	`52`	`+`
	`53`	`+# approach first page`
	`54`	`+chrome_driver.get("https://product.kyobobook.co.kr/bestseller/online?period=001")`
	`55`	`+`
	`56`	`+# 첫번째 제목 저장 리스트 > 반복문 중지 조건으로 필요`
	`57`	`+check_name_list = list()`
	`58`	`+`
	`59`	`+rank_list = list()`
	`60`	`+title_list = list()`
	`61`	`+price_list = list()`
	`62`	`+author_list = list()`
	`63`	`+`
	`64`	`+time.sleep(6)`
	`65`	`+`
	`66`	`+# 반복문`
	`67`	`+while True:`
	`68`	`+`
	`69`	`+ # 끝까지 스크롤 다운 (광고로 페이지 가리기 방지)`
	`70`	`+ chrome_driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")`
	`71`	`+`
	`72`	`+ # source code crawling`
	`73`	`+ source = chrome_driver.page_source`
	`74`	`+`
	`75`	`+ # parsing`
	`76`	`+ html_parsed_source = BeautifulSoup(source, "html.parser")`
	`77`	`+`
	`78`	`+ # extract data(span_prod_name) & saving in list`
	`79`	`+`
	`80`	`+ ####### Title`
	`81`	`+ span_prod_name = html_parsed_source.find_all("span", class_="prod_name")`
	`82`	`+`
	`83`	`+ # while문 중지 조건 -> 같은 title이 list에 존재 할 때`
	`84`	`+ if (span_prod_name[0].text in check_name_list):`
	`85`	`+ chrome_driver.close()`
	`86`	`+ break`
	`87`	`+ check_name_list.append(span_prod_name[0].text)`
	`88`	`+`
	`89`	`+ for title in span_prod_name:`
	`90`	`+ title_list.append(title.text)`
	`91`	`+`
	`92`	`+ ######## Rank`
	`93`	`+ div_prod_rank = html_parsed_source.find_all("div", class_="prod_rank")`
	`94`	`+`
	`95`	`+ for rank in div_prod_rank:`
	`96`	`+ rank_list.append(rank.text)`
	`97`	`+`
	`98`	`+ ######## Price`
	`99`	`+ span_val = html_parsed_source.find_all("span", class_="val")`
	`100`	`+`
	`101`	`+ for price in span_val:`
	`102`	`+ if(price.text == "0"):`
	`103`	`+ None`
	`104`	`+ else:`
	`105`	`+ price_list.append(price.text)`
	`106`	`+`
	`107`	`+ ######### Author`
	`108`	`+ span_prod_author = html_parsed_source.find_all("span", class_="prod_author")`
	`109`	`+`
	`110`	`+ for author in span_prod_author:`
	`111`	`+ author_list.append(author.text.split(" ·")[0])`
	`112`	`+`
	`113`	`+ # 다음 페이지 버튼 XPATH로 이동`
	`114`	`+ chrome_driver.find_element(By.XPATH, '//*[@id="tabRoot"]/div[4]/div[2]/button[2]').click()`
	`115`	`+`
	`116`	`+ time.sleep(6)`
	`117`	`+`
	`118`	`+# extracted data item 개수 일치하는 지 확인`
	`119`	`+book_list = [ title_list, rank_list, price_list, author_list ]`
	`120`	`+`
	`121`	`+for book in book_list:`
	`122`	`+ print(len(book))`
	`123`	+```
	`124`	`+`
	`125`	+```text
	`126`	`+916`
	`127`	`+916`
	`128`	`+916`
	`129`	`+916`
	`130`	+```
	`131`	`+`
	`132`	`+<br>`
	`133`	`+`
	`134`	+```python
	`135`	`+# csv로 출력`
	`136`	`+w_csv = codecs.open("C:/Users/EthanJ/develop/PLAYDATA/Python_basic/crawling/crawler_with_traversal.csv", 'w', "utf-8-sig")`
	`137`	`+`
	`138`	`+for i in range(len(title_list)):`
	`139`	`+ this_line = "%s,%s,%s,%s\n" %(rank_list[i].replace(',', ','), title_list[i].replace(',', ','),`
	`140`	`+ author_list[i].replace(',', ','), price_list[i].replace(',', ','))`
	`141`	`+ w_csv.write(this_line)`
	`142`	`+`
	`143`	`+w_csv.close()`
	`144`	+```
	`145`	`+`
	`146`	`+![1013_2_1](https://github.com/insung-ethan-j/Basic_Python_with_Data/blob/13744218c87d839e99c05dcc2036ba96fbaa6f12/img_source/1013_2_1.PNG?raw=true)`
	`147`	`+`
	`148`	`+`
	`149`	`+![1013_2_2](https://github.com/insung-ethan-j/Basic_Python_with_Data/blob/13744218c87d839e99c05dcc2036ba96fbaa6f12/img_source/1013_2_2.PNG?raw=true)`
	`150`	`+`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit d9ddb17

File tree

8 files changed

8 files changed

`‎markdown/20221011_1_Python_function.md`

`‎markdown/20221012_1_Python_Crawling_with_selenium,_BeautifulSoup.md`

`‎markdown/20221013_1_Python_File_IO_with_codecs_and_Encoding 2.md`

`‎markdown/20221013_2_Python_crawler_with_traversal.md`

0 commit comments