Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit d9ddb17

Browse files
markdown
1 parent c711705 commit d9ddb17

8 files changed

+3148
-0
lines changed

‎markdown/20221011_1_Python_function.md

Lines changed: 573 additions & 0 deletions
Large diffs are not rendered by default.

‎markdown/20221012_1_Python_Crawling_with_selenium,_BeautifulSoup.md

Lines changed: 465 additions & 0 deletions
Large diffs are not rendered by default.

‎markdown/20221013_1_Python_File_IO_with_codecs_and_Encoding 2.md

Lines changed: 423 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
2+
## Python crawler with traversal 파이썬 순회 크롤러
3+
4+
- 같은 양식의 페이지를 순회하면서 자료를 수집해오는 크롤러
5+
6+
- 원 페이지 크롤러 제작 후 > 완성된 크롤러를 반복문에 넣어서 만든다
7+
8+
> 반복을 어디부터 돌릴지에 대한 파악이 제일 중요!
9+
10+
```python
11+
# crwaling library import
12+
from bs4 import BeautifulSoup
13+
from selenium import webdriver
14+
import requests
15+
16+
# 코드 진행 지연을 위한 time 임포트
17+
import time
18+
19+
# 2022-07 이후 selenium 업데이트로 인한 XPATH 추적 시 사용하는 임포트
20+
from selenium.webdriver.common.by import By
21+
22+
# file io
23+
import codecs
24+
```
25+
26+
<br>
27+
28+
- 순서
29+
30+
1. approach N page
31+
32+
2. source code crawling
33+
34+
3. parsing
35+
36+
4. data extraction
37+
38+
5. saving in txt file
39+
40+
6. move to number 1.
41+
42+
> 다음페이지 버튼 XPATH 클릭으로 페이지 넘기기
43+
44+
<br>
45+
46+
- 리스트 형식 페이지: \[F12\] + \[Network menu click\] > 리스트 다음 페이지 클릭
47+
\> url 바뀌지 않아도, Network 변경사항을 \[Headers\], \[Payload\] tab에서 확인 가능
48+
\> XPATH 구하기 가능!
49+
50+
```python
51+
chrome_driver = webdriver.Chrome('chromedriver')
52+
53+
# approach first page
54+
chrome_driver.get("https://product.kyobobook.co.kr/bestseller/online?period=001")
55+
56+
# 첫번째 제목 저장 리스트 > 반복문 중지 조건으로 필요
57+
check_name_list = list()
58+
59+
rank_list = list()
60+
title_list = list()
61+
price_list = list()
62+
author_list = list()
63+
64+
time.sleep(6)
65+
66+
# 반복문
67+
while True:
68+
69+
# 끝까지 스크롤 다운 (광고로 페이지 가리기 방지)
70+
chrome_driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
71+
72+
# source code crawling
73+
source = chrome_driver.page_source
74+
75+
# parsing
76+
html_parsed_source = BeautifulSoup(source, "html.parser")
77+
78+
# extract data(span_prod_name) & saving in list
79+
80+
####### Title
81+
span_prod_name = html_parsed_source.find_all("span", class_="prod_name")
82+
83+
# while문 중지 조건 -> 같은 title이 list에 존재 할 때
84+
if (span_prod_name[0].text in check_name_list):
85+
chrome_driver.close()
86+
break
87+
check_name_list.append(span_prod_name[0].text)
88+
89+
for title in span_prod_name:
90+
title_list.append(title.text)
91+
92+
######## Rank
93+
div_prod_rank = html_parsed_source.find_all("div", class_="prod_rank")
94+
95+
for rank in div_prod_rank:
96+
rank_list.append(rank.text)
97+
98+
######## Price
99+
span_val = html_parsed_source.find_all("span", class_="val")
100+
101+
for price in span_val:
102+
if(price.text == "0"):
103+
None
104+
else:
105+
price_list.append(price.text)
106+
107+
######### Author
108+
span_prod_author = html_parsed_source.find_all("span", class_="prod_author")
109+
110+
for author in span_prod_author:
111+
author_list.append(author.text.split(" ·")[0])
112+
113+
# 다음 페이지 버튼 XPATH로 이동
114+
chrome_driver.find_element(By.XPATH, '//*[@id="tabRoot"]/div[4]/div[2]/button[2]').click()
115+
116+
time.sleep(6)
117+
118+
# extracted data item 개수 일치하는 지 확인
119+
book_list = [ title_list, rank_list, price_list, author_list ]
120+
121+
for book in book_list:
122+
print(len(book))
123+
```
124+
125+
```text
126+
916
127+
916
128+
916
129+
916
130+
```
131+
132+
<br>
133+
134+
```python
135+
# csv로 출력
136+
w_csv = codecs.open("C:/Users/EthanJ/develop/PLAYDATA/Python_basic/crawling/crawler_with_traversal.csv", 'w', "utf-8-sig")
137+
138+
for i in range(len(title_list)):
139+
this_line = "%s,%s,%s,%s\n" %(rank_list[i].replace(',', ','), title_list[i].replace(',', ','),
140+
author_list[i].replace(',', ','), price_list[i].replace(',', ','))
141+
w_csv.write(this_line)
142+
143+
w_csv.close()
144+
```
145+
146+
![1013_2_1](https://github.com/insung-ethan-j/Basic_Python_with_Data/blob/13744218c87d839e99c05dcc2036ba96fbaa6f12/img_source/1013_2_1.PNG?raw=true)
147+
148+
149+
![1013_2_2](https://github.com/insung-ethan-j/Basic_Python_with_Data/blob/13744218c87d839e99c05dcc2036ba96fbaa6f12/img_source/1013_2_2.PNG?raw=true)
150+

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /