네이버 부동산 기사 크롤링

IT/기타

by Adonis_ 2021. 10. 13. 22:06

당일 네이버 부동산기사 크롤링하여 csv파일로 저장

import requests
from pandas import DataFrame
from bs4 import BeautifulSoup
import re
from datetime import datetime
import os
from selenium import webdriver

header = {'User-Agent': 'Mozilla/5.0'}

date = str(datetime.now())
date_str = date.split()[0].replace('-','')
date = date[:date.rfind(':')].replace(' ', '_')
date = date.replace(':','시') + '분'

news_url = 'https://news.naver.com/main/list.naver?mode=LS2D&mid=shm&sid2=260&sid1=101&date={}'
news_url = news_url.format(date_str)

print(news_url)
req = requests.get(news_url, headers=header)
soup = BeautifulSoup(req.text, 'html.parser')

news_dict = {} # result 
idx = 0
cur_page = 1

print('크롤링 중...')

pages = soup.find('div', {'class' : 'paging'})

while cur_page <= (len(pages.find_all('a'))+1):
  table = soup.find('div',{'class' : 'list_body newsflash_body'})
  li_list = table.find_all('li')
  
  area_list = [li.find('dt') for li in li_list]
  #사진없는 기사 포함된 경우 오류발생
  # area_list = [li.find('dt', {'class' : 'photo'}) for li in li_list] 
  a_list = [area.find('a') for area in area_list]

  for n in a_list[:len(a_list)]:
    try:
      news_dict[idx] = {'title' : n.find('img').get('alt').strip(),'url' : n.get('href')}
    except:
      news_dict[idx] = {'title' : n.text.strip(),'url' : n.get('href')}
    idx += 1

  if cur_page<len(pages.find_all('a'))+1:
    next_page_url = [p for p in pages.find_all('a') if p.text == str(cur_page+1)][0].get('href')
    req = requests.get('https://news.naver.com/main/list.naver' + next_page_url, headers=header)
    soup = BeautifulSoup(req.text, 'html.parser')
  cur_page += 1

print('크롤링 완료')

news_df = DataFrame(news_dict).T

folder_path = os.getcwd()

xlsx_file_name = '네이버뉴스_{}.csv'.format(date)
news_df.to_csv(xlsx_file_name, index=None, encoding='euc-kr')

'IT > 기타' 카테고리의 다른 글

Jupyter notebook 설치 (0)	2021.10.29
[Python] 'pip3'은(는) 내부 또는 외부 명령, 실행할 수 있는 프로그램, 또는 배치 파일이 아닙니다. 오류 (0)	2021.10.29
Python 설치방법(Windows 10) (0)	2021.10.29
WordCloud 생성하기 (0)	2021.10.12