파이썬 웹 크롤링(Web Crawling) 웹페이지 긁어오기

댕댕냥
WebProgramming
2020. 2. 4.

파이썬 웹 크롤링(Web Crawling) - Basic

웹 크롤러(Web Crawler)는 자동화된 방식으로 웹 페이지들을 탐색하는 컴퓨터 프로그램입니다.
웹 크롤러가 하는 작업을 웹 크롤링(Web Crawling)이라고 부릅니다.

Beautiful Soup

기본 세팅
기본적으로 패키지 import를 통해서 가져오며 html파일을 가져오거나 urllib 혹은 requests 모듈을 통해서 직접 웹에서 소스를 가져올 수도 있습니다.

주요 함수

find() 및 find_all()함수

함수 인자로는 찾고자 하는 태그의 이름, 속성 기타 등등이 들어갑니다.
find_all(name, attrs, recursive, string, limit, **kwargs)
find_all() : 해당 조건에 맞는 모든 태그들을 가져옵니다.

html = urlopen('url 주소') 
soup = BeautifulSoup(html, 'html.parser')
all_divs = soup.find_all("div")
print(all_divs)
------------------
# find_all('태그명', {'속성명' : '값' ...})
ex_id_divs = soup.find('div', {'id' : 'ex_id'})
print(ex_id_divs)

find(name, attrs, recursive, string, **kwargs)
find() : 해당 조건에 맞는 하나의 태그를 가져온다. 중복이면 가장 첫 번째 태그를 가져온다.

html = urlopen('url 주소') 
soup = BeautifulSoup(fp, 'html.parser')
ex_id_divs = soup.find('div', {'id' : 'ex_id'})
print(ex_id_divs)
-----------------
#find('태그명', {'속성명' : '값' ...})
first_div = soup.find("div")
print(first_div)

예제 1) 웹 문서 전체 가져오기


$ pip install requests
$ pip install beautifulsoup4

request : Requests를 사용하면 간단한 코드만으로 웹페이지의 html 소스를 가져올 수 있습니다.
beautifulsoup4 : 파이썬 웹 크롤링 라이브러리


from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.naver.com")  
bsObject = BeautifulSoup(html, "html.parser") 
print(bsObject) # 웹 문서 전체가 출력

print(bsObject.head.title) # <title>NAVER</title> 출력

for meta in bsObject.head.find_all('meta'):
    print(meta.get('content')) # 모든 메타 데이터의 내용 출력

print (bsObject.head.find("meta", {"name":"description"})) # 원하는 태그의 내용 출력

for link in bsObject.find_all('a'):
    print(link.text.strip(), link.get('href')) #a 태그로 둘러싸인 텍스트와 a 태그의 href 속성을 출력

urlopen 함수를 사용하여 원하는 주소로부터 웹페이지를 가져온 후, BeautifulSoup 객체로 변환합니다.

BeautifulSoup 객체는 웹문서를 파싱한 상태입니다. 웹 문서가 태그 별로 분해되어 태그로 구성된 트리가 구성됩니다.
포함하는 태그가 부모가 되고 포함된 태그가 자식이 되어 트리를 구성하고 있습니다.

예제 2) 교보문고 베스트셀러 책이름 , 저자 , 가격 출력하기


from urllib.request import urlopen
from bs4 import BeautifulSoup as bs

# 교보문고의 베스트셀러 웹페이지를 가져옵니다.

html = urlopen('http://www.kyobobook.co.kr/bestSellerNew/bestseller.laf')
bsObject = bs(html, "html.parser")

# 책의 상세 웹페이지 주소를 추출하여 리스트에 저장합니다.
book_page_urls = []
for cover in bsObject.find_all('div', {'class':'detail'}):
    link = cover.select('a')[0].get('href')
    book_page_urls.append(link)

# 메타 정보로부터 필요한 정보를 추출합니다.메타 정보에 없는 저자 정보만 따로 가져왔습니다.   
for index, book_page_url in enumerate(book_page_urls):
    html = urlopen(book_page_url)
    bsObject = bs(html, "html.parser")
    title = bsObject.find('meta', {'property':'rb:itemName'}).get('content')
    author = bsObject.select('span.name a')[0].text
    image = bsObject.find('meta', {'property':'rb:itemImage'}).get('content')
    url = bsObject.find('meta', {'property':'rb:itemUrl'}).get('content')
    originalPrice = bsObject.find('meta', {'property': 'rb:originalPrice'}).get('content')
    salePrice = bsObject.find('meta', {'property':'rb:salePrice'}).get('content')

    print(index+1, title, author, image, url, originalPrice, salePrice)

2020년 02월 기준

예제 3) 네이버 베스트셀러 책이름, 저자, 가격 출력하기

from urllib.request import urlopen
from bs4 import BeautifulSoup


# 네이버의 베스트셀러 웹페이지를 가져옵니다.
html = urlopen('https://book.naver.com/bestsell/bestseller_list.nhn')
bsObject = BeautifulSoup(html, "html.parser")


# 책의 상세 웹페이지 주소를 추출하여 리스트에 저장합니다.
book_page_urls = []
for index in range(0, 25):
    dl_data = bsObject.find('dt', {'id':"book_title_"+str(index)})
    link = dl_data.select('a')[0].get('href')
    book_page_urls.append(link)



# 메타 정보와 본문에서 필요한 정보를 추출합니다.  
for index, book_page_url in enumerate(book_page_urls):
    html = urlopen(book_page_url)
    bsObject = BeautifulSoup(html, "html.parser")


    title = bsObject.find('meta', {'property':'og:title'}).get('content')
    author = bsObject.find('dt', text='저자').find_next_siblings('dd')[0].text.strip()
    image = bsObject.find('meta', {'property':'og:image'}).get('content')
    url = bsObject.find('meta', {'property':'og:url'}).get('content')

    dd = bsObject.find('dt', text='가격').find_next_siblings('dd')[0]
    salePrice = dd.select('div.lowest strong')[0].text
    originalPrice = dd.select('div.lowest span.price')[0].text

    print(index+1, title, author, image, url, originalPrice, salePrice)

예제 4) 네이버 블로그 검색결과 가져오기

from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
import urllib.parse


# 네이버 검색 후 검색 결과
baseUrl = 'https://search.naver.com/search.naver?where=post&sm=tab_jum&query='
plusUrl = input('검색어를 입력하세요 : ')
# 한글 검색 자동 변환
url = baseUrl + urllib.parse.quote_plus(plusUrl)
html = urlopen(url)
bsObject = bs(html, "html.parser")

# 조건에 맞는 파일을 다 출력해라
title = bsObject.find_all(class_='sh_blog_title')


for i in title:
    print(i.attrs['title'])
    print(i.attrs['href'])
    print()

결과

검색어를 입력하세요 : 크롤링
웹크롤링 [금통위의사록 파이썬으로 다운받기]
https://blog.naver.com/jjys9047?Redirect=Log&logNo=221584977592

웹 구조를 이해한 자의 웹크롤링은 데이터를 다루는 디테일부터 다르다. By >파이썬을 활용한 실전 웹크롤링과 자동화 CAMP 박두진 강사님
http://blog.fastcampus.co.kr/221586197326

[Week 1] 데이터 사이언스 기초: 웹페이지에서 데이터 수집하기 (데이터 크롤링)
https://piry777.blog.me/221662360000

[파이썬 활용] 크롤링
https://blog.naver.com/mathesis_time?Redirect=Log&logNo=221525076829

▣ 웹크롤링 / 스크래핑 프로그램 OCTOPARSE 사용기
https://blog.naver.com/no1_devicemart?Redirect=Log&logNo=221539107537

발리 여행, 길리 트라왕안, 파티섬, 펍 크롤링, Pub Crawling 후기
https://blog.naver.com/grang353?Redirect=Log&logNo=221576202119

광고,홍보 위주로 활용이 가능한 웹크롤링 젠서버 컴퓨터 입니다.
https://blog.naver.com/kukuri0_0?Redirect=Log&logNo=221510474601

[Python] 파이썬 웹 크롤링 #1. 네이버 실시간 검색어 가져오기
https://dsz08082.blog.me/221587474567

10.1 R로 다음(Daum) 네티즌 리뷰 크롤링하기
https://blog.naver.com/pmw9440?Redirect=Log&logNo=221590746010

에브리타임 자동 크롤링 / 봇 시스템
https://blog.naver.com/kbs4674?Redirect=Log&logNo=221460241196

cf) 여러 페이지 블로그 게시물 가져오기

from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
from urllib.parse import quote_plus

plusUrl = quote_plus(input('검색어를 입력하세요 : '))
pageNum = 1
count = 1

url = f'https://search.naver.com/search.naver?date_from=&date_option=0&date_to=&dup_remove=1&nso=&post_blogurl=&post_blogurl_without=&query={plusUrl}&sm=tab_pge&srchby=all&st=sim&where=post&start={pageNum}'

i = input('몇 페이지를 크롤링 할까요? : ')
lastPage = int(i) * 10 - 9
while pageNum < lastPage + 1:
    url = f'https://search.naver.com/search.naver?date_from=&date_option=0&date_to=&dup_remove=1&nso=&post_blogurl=&post_blogurl_without=&query={plusUrl}&sm=tab_pge&srchby=all&st=sim&where=post&start={pageNum}'
    html = urlopen(url)
    soup = bs(html, "html.parser")
    
    # 조건에 맞는 파일을 다 출력해라
    title = soup.find_all(class_='sh_blog_title')

    print(f'---{count}페이지 결과입니다 --------')
    for i in title:
        print(i.attrs['title'])
        print(i.attrs['href'])
        print()
    pageNum += 10
    count += 1

예제 5) 네이버 이미지 검색결과 저장하기

from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
from urllib.parse import quote_plus

baseUrl = 'https://search.naver.com/search.naver?where=image&sm=tab_jum&query='
plusUrl = input('검색어를 입력하세요 : ')
# 한글 검색 자동 변환
url = baseUrl + quote_plus(plusUrl)
html = urlopen(url)
soup = bs(html, "html.parser")
img = soup.find_all(class_='_img')

n = 1
for i in img:
    imgUrl = i['data-source']
    with urlopen(imgUrl) as f:
        with open('./img/' + plusUrl + str(n)+'.jpg','wb') as h: # w - write b - binary
            img = f.read()
            h.write(img)
    n += 1
print('다운로드 완료')

예제 6) 인스타그램 해시태그 검색 시 이미지 다운로드하기

Crome Driver 설치 링크

cf) Chrome 버전과 맞는 Crome Driver 설치를 해야합니다. (맞지 않으면 오류))

instagram은 javascript 기반의 환경이므로 BeautifulSoup으로 크롤링이 불가합니다.

-> selenium 사용

pip install selenium

from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from urllib.parse import quote_plus
import time


baseUrl = 'https://www.instagram.com/explore/tags/'
plusUrl = input('검색할 태그를 입력하세요 : ')
# 한글 검색 자동 변환
url = baseUrl + quote_plus(plusUrl)

# Crome 드라이버 지정
driver = webdriver.Chrome()
driver.get(url)

# 
time.sleep(3)

html = driver.page_source
soup = bs(html, "html.parser")

insta = soup.select('.v1Nh3.kIKUG._bz0w') # 태그
# print(insta[0]) # 한개 데이터만 가지고와라 
n = 1
for i in insta:
    print('https://www.instagram.com' + i.a['href'])
    imgUrl = i.select_one('.KL4Bh').img['src']
    with urlopen(imgUrl) as f:
        with open('./img/' + plusUrl + str(n)+'.jpg','wb') as h:
            img = f.read()
            h.write(img)
    n += 1
    print(imgUrl)
    print()
driver.close()

예제 7) 네이버 블로그 검색결과 CSV(엑셀) 파일로 저장하기

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
from urllib.parse import quote_plus

# class="api_txt_lines total_tit" 
# - 네이버 모바일 VIEW탭에 제목 class

search = input('검색어를 입력하세요 : ')

url = f'https://m.search.naver.com/search.naver?where=m_view&sm=mtb_jum&query={quote_plus(search)}'

html = urlopen(url).read()
soup = bs(html, "html.parser")

total = soup.select('.api_txt_lines.total_tit')
searchList = []

for i in total:
    temp = []
    temp.append(i.text) # 제목
    temp.append(i.attrs['href']) # 링크
    searchList.append(temp)
# 엑셀에서 열때 utf-8 표준이면 깨짐현상이 일어남
f = open(f'{search}.csv', 'w', encoding = 'cp949', newline='')
csvWriter = csv.writer(f)
for i in searchList:
    # 한줄씩 써 내려감
    csvWriter.writerow(i)
f.close()

print('완료 되었습니다.')

예제 8) 구글 검색결과 CSV(엑셀) 파일로 저장하기

import csv
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from urllib.parse import quote_plus


baseUrl = 'https://www.google.co.kr/search?q='
plusUrl = input('검색어를 입력하세요 : ')
# 한글 검색 자동 변환
url = baseUrl + quote_plus(plusUrl)

driver = webdriver.Chrome()
driver.get(url)


html = driver.page_source
soup = bs(html, "html.parser")

r = soup.select('.r')
searchList = []

for i in r:
    temp = []
    temp.append(i.select_one('.LC20lb').text) # 제목
    temp.append(i.a.attrs['href']) # 링크
    print()
    searchList.append(temp)

driver.close()

f = open(f'{plusUrl}.csv', 'w', encoding = 'cp949', newline='')
csvWriter = csv.writer(f)
for i in searchList:
    # 한줄씩 써 내려감
    csvWriter.writerow(i)
f.close()

print('완료 되었습니다.')

크롤링을 검색해서 나온 10개의 결과 값을 저장했습니다.(구글)

저작자표시 (새창열림)

파이썬 웹 크롤링(Web Crawling) 웹페이지 긁어오기

파이썬 웹 크롤링(Web Crawling) - Basic

Beautiful Soup

주요 함수

find() 및 find_all()함수

예제 1) 웹 문서 전체 가져오기

예제 2) 교보문고 베스트셀러 책이름 , 저자 , 가격 출력하기

예제 3) 네이버 베스트셀러 책이름, 저자, 가격 출력하기

예제 4) 네이버 블로그 검색결과 가져오기

cf) 여러 페이지 블로그 게시물 가져오기

예제 5) 네이버 이미지 검색결과 저장하기

예제 6) 인스타그램 해시태그 검색 시 이미지 다운로드하기

예제 7) 네이버 블로그 검색결과 CSV(엑셀) 파일로 저장하기

예제 8) 구글 검색결과 CSV(엑셀) 파일로 저장하기

Copyright © 오늘의 힐링 펫스토리 All Rights Reserved

Designed by 오늘의 힐링 펫스토리

파이썬 웹 크롤링(Web Crawling) - Basic

Beautiful Soup

주요 함수

find() 및 find_all()함수

예제 1) 웹 문서 전체 가져오기

예제 2) 교보문고 베스트셀러 책이름 , 저자 , 가격 출력하기

예제 3) 네이버 베스트셀러 책이름, 저자, 가격 출력하기

예제 4) 네이버 블로그 검색결과 가져오기

cf) 여러 페이지 블로그 게시물 가져오기

예제 5) 네이버 이미지 검색결과 저장하기

예제 6) 인스타그램 해시태그 검색 시 이미지 다운로드하기

예제 7) 네이버 블로그 검색결과 CSV(엑셀) 파일로 저장하기

예제 8) 구글 검색결과 CSV(엑셀) 파일로 저장하기

'WebProgramming' 관련 글

윈도우 중급. 고급 사용자 Reg Pack(시스템 오류해결, 정리, 복구)

[Python] 크롤링 연습문제. reddit 크롤링 풀이

[Python] requests 기초와 beautiful soup를 활용한 크롤링

[Python] 크롤링 기초 개념과 requests를 이용한 기초실습(설치부터)

Copyright © 오늘의 힐링 펫스토리 All Rights Reserved

Designed by 오늘의 힐링 펫스토리

티스토리툴바