1. Raw Data Scraping
Local PC
에서 작업
Load Packages
import requests
from urllib import request
from bs4 import BeautifulSoup
from PIL import Image
import re
import pandas as pd
import numpy as np
from tqdm import tqdm
from selenium import webdriver
import time
tqdm().pandas()
0it [00:00, ?it/s][A[A[A[A
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tqdm/std.py:658: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
from pandas import Panel
Scraping Order
- 각 장르별로 데이터프레임을 따로 구성 (총 7개의 데이터프레임, 각각 200개 레코드 존재)
- 한 데이터프레임 내에는 영화 코드, 한글 제목, 영어 제목, 연도, 네티즌 평점, 등급, 링크, 이미지, 장르 존재
- 상세 정보를 가져오기 위해 각 장르별 평점 순 랭킹 페이지를 1~4 page 순으로 크롤링
- 적어도 4 page까지는 평점 랭킹 존재하는 영화들 수집 (각 200개 * 7개 장르 = 1400개 데이터)
- 드라마: 1, 공포: 4, 멜로/애정/로맨스: 5, 코미디: 11, 애니메이션: 15, 범죄: 16, 액션: 19
- 크롤링하면서 각 영화별 상세 페이지에 접속하여 한글 제목, 영어 제목, 연도, 네티즌 평점, 등급, 이미지, 장르 가져오기
- 각 장르별로 반복
NAVER movie login
성인 인증이 필요한 영화의 경우 로그인하지 않은 상태로는 바로 페이지 이동이 되지 않으므로, 미리 로그인이 필요
login_url = 'https://nid.naver.com/nidlogin.login'
id_key = 'kmoshn815'
pw_key = '********'
driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
# Basic NAVER movie url
basic_url = 'https://movie.naver.com'
# Basic Ranking url (genre, page는 변수 형태)
rank_url = 'https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=pnt&date=20200616&tg={genre}&page={page}'
a. Drama
drama_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
'rating', 'rank', 'link', 'genre'])
drama_df
code | title_kor | title_eng | year | rating | rank | link | genre |
---|
poster_drama = np.memmap('images/poster_drama', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
print('page ' + str(l + 1) + ' crawling...')
# 단일 페이지 접속
driver.get(rank_url.format(genre=1, page=l + 1))
html = driver.page_source
time.sleep(1)
soup = BeautifulSoup(html, 'lxml')
for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
# link
link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
# rating
rating = soup.find_all('td', class_='point')[k].text
# code
code_start = re.search('code=', link).span()[1]
code = link[code_start:]
# 상세 페이지 접속
driver.get(link)
link_html = driver.page_source
time.sleep(1)
link_soup = BeautifulSoup(link_html, 'lxml')
# 영화 정보
movie_info = link_soup.find('div', class_='mv_info')
# title_kor
title_kor = movie_info.h3.a.text
# title_eng
title_eng = movie_info.strong.text.split(',')[0].strip()
# year
year = movie_info.strong.text.split(',')[-1].strip()
# rank
rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text
# genre
genre = []
for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
# poster
poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
poster_end = re.search('\?', poster_thumbnail).span()[0]
poster_link = poster_thumbnail[:poster_end]
poster_index = l * 50 + k
poster_drama[poster_index, :, :, :] = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
# 데이터프레임에 추가
drama_df = drama_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
'year': year, 'rating': rating, 'rank': rank, 'link': link,
'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...
100%|██████████| 50/50 [01:25<00:00, 1.71s/it]
page 2 crawling...
100%|██████████| 50/50 [01:20<00:00, 1.61s/it]
page 3 crawling...
100%|██████████| 50/50 [01:20<00:00, 1.60s/it]
page 4 crawling...
100%|██████████| 50/50 [01:23<00:00, 1.67s/it]
Completed!
driver.close()
poster_drama.flush()
drama_df
code | title_kor | title_eng | year | rating | rank | link | genre | |
---|---|---|---|---|---|---|---|---|
0 | 171539 | 그린 북 | Green Book | 2018 | 9.59 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마] |
1 | 174830 | 가버나움 | Capharnaum | 2018 | 9.58 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마] |
2 | 151196 | 원더 | Wonder | 2017 | 9.49 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마] |
3 | 169240 | 아일라 | Ayla: The Daughter of War | 2017 | 9.48 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마, 전쟁] |
4 | 157243 | 당갈 | Dangal | 2016 | 9.47 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마, 액션] |
... | ... | ... | ... | ... | ... | ... | ... | ... |
195 | 34566 | 루키 | The Rookie | 2002 | 8.95 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마] |
196 | 16792 | 흐르는 강물처럼 | A River Runs Through It | 1992 | 8.95 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마] |
197 | 129046 | 리틀 보이 | Little Boy | 2015 | 8.95 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마, 전쟁] |
198 | 83160 | 씨민과 나데르의 별거 | Jodaeiye Nader Az Simin | 2011 | 8.95 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마] |
199 | 27109 | 소년은 울지 않는다 | Boys Don't Cry | 1999 | 8.95 | 청소년 관람불가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마] |
200 rows × 8 columns
drama_df.to_csv('data/drama_df.csv', index=False)
np.save('images/poster_drama.npy', poster_drama)
poster_d = np.load('images/poster_drama.npy')
print(poster_d.shape)
poster_d
(200, 256, 256, 3)
array([[[[ 0, 100, 116],
[ 0, 100, 116],
[ 0, 100, 117],
...,
[ 1, 115, 141],
[ 0, 116, 141],
[ 1, 117, 142]],
[[ 0, 100, 116],
[ 0, 100, 115],
[ 0, 101, 118],
...,
[ 0, 115, 141],
[ 0, 116, 141],
[ 0, 116, 141]],
[[ 0, 101, 116],
[ 0, 101, 116],
[ 0, 101, 118],
...,
[ 1, 117, 142],
[ 1, 117, 142],
[ 1, 117, 142]],
...,
[[ 0, 106, 130],
[ 1, 107, 131],
[ 0, 108, 132],
...,
[ 0, 105, 127],
[ 0, 105, 127],
[ 0, 105, 127]],
[[ 0, 106, 130],
[ 0, 106, 130],
[ 1, 107, 131],
...,
[ 1, 104, 127],
[ 1, 104, 126],
[ 1, 104, 127]],
[[ 0, 106, 130],
[ 0, 106, 130],
[ 1, 107, 131],
...,
[ 1, 104, 127],
[ 1, 104, 127],
[ 2, 104, 127]]],
[[[227, 221, 218],
[231, 231, 227],
[237, 238, 239],
...,
[ 87, 56, 93],
[ 89, 58, 94],
[ 89, 57, 92]],
[[231, 229, 226],
[232, 231, 227],
[237, 237, 236],
...,
[ 89, 58, 95],
[ 88, 58, 93],
[ 89, 58, 94]],
[[236, 235, 233],
[236, 236, 235],
[237, 237, 236],
...,
[ 88, 58, 96],
[ 88, 58, 95],
[ 89, 58, 94]],
...,
[[104, 129, 151],
[105, 128, 150],
[106, 130, 152],
...,
[125, 125, 84],
[123, 124, 94],
[ 97, 104, 102]],
[[103, 129, 151],
[102, 128, 150],
[104, 129, 152],
...,
[115, 118, 87],
[111, 115, 91],
[ 87, 96, 99]],
[[100, 129, 149],
[100, 128, 149],
[103, 128, 150],
...,
[106, 112, 85],
[ 97, 106, 86],
[ 79, 85, 93]]],
[[[242, 225, 217],
[241, 223, 213],
[241, 222, 212],
...,
[250, 250, 231],
[253, 252, 248],
[253, 253, 252]],
[[252, 252, 252],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
[[247, 234, 229],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
...,
[[100, 97, 86],
[ 98, 95, 85],
[ 96, 95, 85],
...,
[114, 102, 86],
[111, 98, 79],
[114, 97, 77]],
[[100, 99, 90],
[ 89, 88, 79],
[ 76, 73, 65],
...,
[114, 100, 81],
[118, 102, 83],
[119, 100, 80]],
[[ 90, 89, 83],
[ 89, 86, 80],
[ 88, 83, 76],
...,
[102, 90, 73],
[104, 92, 76],
[103, 88, 74]]],
...,
[[[ 16, 25, 42],
[ 16, 25, 42],
[ 16, 25, 42],
...,
[ 16, 25, 42],
[ 16, 25, 42],
[ 16, 25, 41]],
[[ 16, 25, 42],
[ 16, 25, 42],
[ 16, 25, 42],
...,
[ 16, 25, 42],
[ 16, 25, 42],
[ 16, 25, 42]],
[[ 16, 25, 42],
[ 16, 25, 42],
[ 16, 25, 42],
...,
[ 16, 25, 42],
[ 16, 25, 42],
[ 16, 25, 42]],
...,
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 1, 1, 1]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 1, 1, 1]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 1, 1, 1]]],
[[[ 77, 65, 62],
[ 74, 63, 59],
[ 72, 61, 56],
...,
[ 93, 83, 74],
[ 93, 83, 73],
[ 97, 87, 78]],
[[ 78, 66, 60],
[ 74, 64, 58],
[ 71, 61, 56],
...,
[ 92, 82, 73],
[ 92, 82, 73],
[ 98, 88, 79]],
[[ 78, 66, 59],
[ 74, 63, 58],
[ 71, 60, 56],
...,
[ 92, 82, 73],
[ 93, 83, 74],
[ 97, 87, 78]],
...,
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]]],
[[[146, 125, 54],
[142, 121, 80],
[146, 132, 95],
...,
[252, 243, 142],
[252, 243, 142],
[253, 243, 144]],
[[225, 196, 110],
[182, 155, 92],
[151, 130, 82],
...,
[252, 243, 142],
[252, 243, 142],
[253, 243, 144]],
[[224, 182, 102],
[229, 193, 106],
[221, 195, 108],
...,
[252, 243, 142],
[252, 243, 142],
[252, 242, 144]],
...,
[[205, 90, 43],
[199, 83, 36],
[206, 90, 43],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 5, 5, 5]],
[[203, 88, 41],
[191, 75, 28],
[196, 80, 33],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 4, 4, 4]],
[[209, 96, 48],
[195, 81, 34],
[197, 83, 36],
...,
[ 3, 3, 3],
[ 2, 2, 2],
[ 9, 9, 9]]]], dtype=uint8)
b. Horror
driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
time.sleep(1)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
horror_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
'rating', 'rank', 'link', 'genre'])
horror_df
code | title_kor | title_eng | year | rating | rank | link | genre |
---|
poster_horror = np.memmap('images/poster_horror', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
print('page ' + str(l + 1) + ' crawling...')
# 단일 페이지 접속
driver.get(rank_url.format(genre=4, page=l + 1))
html = driver.page_source
time.sleep(1)
soup = BeautifulSoup(html, 'lxml')
for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
# link
link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
# rating
rating = soup.find_all('td', class_='point')[k].text
# code
code_start = re.search('code=', link).span()[1]
code = link[code_start:]
# 상세 페이지 접속
driver.get(link)
link_html = driver.page_source
time.sleep(1)
link_soup = BeautifulSoup(link_html, 'lxml')
# 영화 정보
movie_info = link_soup.find('div', class_='mv_info')
# title_kor
title_kor = movie_info.h3.a.text
# title_eng
title_eng = movie_info.strong.text.split(',')[0].strip()
# year
year = movie_info.strong.text.split(',')[-1].strip()
# rank
rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text
# genre
genre = []
for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
# poster
poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
poster_end = re.search('\?', poster_thumbnail).span()[0]
poster_link = poster_thumbnail[:poster_end]
poster_index = l * 50 + k
poster_horror[poster_index, :, :, :] = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
# 데이터프레임에 추가
horror_df = horror_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
'year': year, 'rating': rating, 'rank': rank, 'link': link,
'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...
100%|██████████| 50/50 [01:42<00:00, 2.06s/it]
page 2 crawling...
100%|██████████| 50/50 [01:37<00:00, 1.96s/it]
page 3 crawling...
100%|██████████| 50/50 [01:41<00:00, 2.03s/it]
page 4 crawling...
100%|██████████| 50/50 [01:48<00:00, 2.18s/it]
Completed!
driver.close()
poster_horror.flush()
horror_df
code | title_kor | title_eng | year | rating | rank | link | genre | |
---|---|---|---|---|---|---|---|---|
0 | 17254 | 뱀파이어와의 인터뷰 | Interview With The Vampire: The Vampire Chroni... | 1994 | 9.12 | 청소년 관람불가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [공포, 드라마] |
1 | 10037 | 에이리언 | Alien | 1979 | 9.12 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [공포, SF] |
2 | 10050 | 싸이코 | Psycho | 1960 | 9.11 | 청소년 관람불가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [공포, 스릴러, 미스터리] |
3 | 10029 | 죠스 | Jaws | 1975 | 8.90 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [공포, 스릴러] |
4 | 126389 | 무서운 집 | Scary house | 2014 | 8.89 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [공포] |
... | ... | ... | ... | ... | ... | ... | ... | ... |
195 | 172003 | 속닥속닥 | The Whispering | 2018 | 5.06 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [공포, 미스터리] |
196 | 65535 | 해부학 교실 | Cadaver | 2007 | 5.06 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [공포, 미스터리] |
197 | 65901 | 할로윈: 살인마의 탄생 | Halloween | 2007 | 5.06 | 청소년 관람불가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [공포] |
198 | 63202 | 커버넌트 | The Covenant | 2006 | 5.05 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [공포, 스릴러] |
199 | 125436 | 포레스트: 죽음의 숲 | The Forest | 2016 | 5.04 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [공포] |
200 rows × 8 columns
horror_df.to_csv('data/horror_df.csv', index=False)
np.save('images/poster_horror.npy', poster_horror)
poster_h = np.load('images/poster_horror.npy')
print(poster_h.shape)
poster_h
(200, 256, 256, 3)
array([[[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
...,
[[ 15, 5, 9],
[ 12, 3, 6],
[ 8, 3, 4],
...,
[ 12, 8, 5],
[ 13, 9, 6],
[ 13, 9, 6]],
[[ 14, 4, 8],
[ 12, 3, 6],
[ 8, 3, 4],
...,
[ 12, 8, 5],
[ 14, 10, 7],
[ 14, 10, 7]],
[[ 12, 2, 6],
[ 10, 1, 4],
[ 6, 1, 2],
...,
[ 12, 8, 5],
[ 12, 8, 5],
[ 13, 9, 6]]],
[[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 1, 0, 1],
[ 1, 1, 0],
[ 1, 2, 1]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 2, 0, 1],
[ 1, 0, 1],
[ 6, 12, 6]],
[[ 0, 1, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 1, 0, 0]],
...,
[[212, 182, 128],
[ 91, 66, 51],
[ 3, 0, 0],
...,
[ 45, 53, 27],
[ 61, 71, 52],
[ 78, 89, 76]],
[[122, 108, 32],
[ 71, 53, 18],
[ 12, 3, 0],
...,
[ 39, 34, 5],
[ 30, 36, 4],
[ 41, 52, 23]],
[[181, 147, 125],
[129, 104, 83],
[ 78, 55, 48],
...,
[ 58, 63, 40],
[ 78, 88, 74],
[ 59, 64, 63]]],
[[[102, 2, 2],
[ 98, 2, 1],
[100, 1, 1],
...,
[ 64, 2, 2],
[ 62, 2, 2],
[ 62, 2, 2]],
[[103, 1, 2],
[102, 1, 2],
[104, 2, 2],
...,
[ 61, 3, 2],
[ 62, 3, 2],
[ 62, 3, 2]],
[[104, 1, 2],
[104, 1, 2],
[104, 1, 2],
...,
[ 61, 3, 2],
[ 60, 3, 2],
[ 61, 3, 2]],
...,
[[ 16, 0, 1],
[ 16, 0, 1],
[ 16, 0, 1],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 16, 0, 1],
[ 16, 0, 1],
[ 17, 1, 2],
...,
[ 2, 0, 1],
[ 2, 0, 1],
[ 1, 1, 1]],
[[ 17, 1, 2],
[ 17, 1, 2],
[ 18, 1, 3],
...,
[ 13, 1, 3],
[ 20, 1, 1],
[ 31, 2, 4]]],
...,
[[[ 3, 4, 9],
[ 1, 2, 7],
[ 2, 3, 8],
...,
[ 4, 4, 6],
[ 3, 3, 5],
[ 1, 1, 3]],
[[ 3, 4, 9],
[ 1, 2, 7],
[ 2, 3, 8],
...,
[ 4, 4, 6],
[ 3, 3, 5],
[ 1, 1, 3]],
[[ 2, 3, 7],
[ 2, 3, 7],
[ 1, 2, 6],
...,
[ 4, 4, 6],
[ 3, 3, 5],
[ 2, 2, 4]],
...,
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]]],
[[[ 11, 40, 58],
[ 9, 38, 56],
[ 10, 37, 56],
...,
[ 14, 24, 36],
[ 14, 24, 36],
[ 15, 25, 36]],
[[ 9, 38, 56],
[ 7, 36, 54],
[ 9, 36, 55],
...,
[ 13, 23, 35],
[ 13, 23, 36],
[ 13, 22, 37]],
[[ 9, 38, 56],
[ 7, 36, 54],
[ 8, 35, 54],
...,
[ 14, 24, 36],
[ 14, 24, 36],
[ 14, 23, 37]],
...,
[[ 4, 4, 4],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 4, 4, 4],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 4, 4, 4],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]]],
[[[168, 158, 146],
[168, 158, 146],
[167, 157, 145],
...,
[173, 165, 152],
[173, 165, 152],
[172, 164, 151]],
[[169, 159, 147],
[169, 159, 147],
[168, 158, 146],
...,
[174, 166, 153],
[173, 165, 152],
[173, 165, 152]],
[[169, 159, 147],
[169, 159, 147],
[168, 158, 146],
...,
[174, 166, 153],
[174, 166, 153],
[173, 165, 152]],
...,
[[203, 197, 185],
[203, 197, 185],
[203, 197, 185],
...,
[204, 198, 186],
[204, 198, 186],
[204, 198, 186]],
[[203, 197, 185],
[203, 197, 185],
[203, 197, 185],
...,
[204, 198, 186],
[204, 198, 186],
[204, 198, 186]],
[[203, 197, 185],
[202, 196, 184],
[202, 196, 184],
...,
[204, 198, 186],
[204, 198, 186],
[204, 198, 186]]]], dtype=uint8)
c. Romance
driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
time.sleep(1)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
romance_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
'rating', 'rank', 'link', 'genre'])
romance_df
code | title_kor | title_eng | year | rating | rank | link | genre |
---|
poster_romance = np.memmap('images/poster_romance', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
print('page ' + str(l + 1) + ' crawling...')
# 단일 페이지 접속
driver.get(rank_url.format(genre=5, page=l + 1))
html = driver.page_source
time.sleep(1)
soup = BeautifulSoup(html, 'lxml')
for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
# link
link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
# rating
rating = soup.find_all('td', class_='point')[k].text
# code
code_start = re.search('code=', link).span()[1]
code = link[code_start:]
# 상세 페이지 접속
driver.get(link)
link_html = driver.page_source
time.sleep(1)
link_soup = BeautifulSoup(link_html, 'lxml')
# 영화 정보
movie_info = link_soup.find('div', class_='mv_info')
# title_kor
title_kor = movie_info.h3.a.text
# title_eng
title_eng = movie_info.strong.text.split(',')[0].strip()
# year
year = movie_info.strong.text.split(',')[-1].strip()
# rank
rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text
# genre
genre = []
for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
# poster
poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
poster_end = re.search('\?', poster_thumbnail).span()[0]
poster_link = poster_thumbnail[:poster_end]
poster_index = l * 50 + k
poster_romance[poster_index, :, :, :] = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
# 데이터프레임에 추가
romance_df = romance_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
'year': year, 'rating': rating, 'rank': rank, 'link': link,
'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...
100%|██████████| 50/50 [01:54<00:00, 2.30s/it]
page 2 crawling...
100%|██████████| 50/50 [01:42<00:00, 2.05s/it]
page 3 crawling...
100%|██████████| 50/50 [01:35<00:00, 1.92s/it]
page 4 crawling...
100%|██████████| 50/50 [01:40<00:00, 2.00s/it]
Completed!
driver.close()
poster_romance.flush()
romance_df
code | title_kor | title_eng | year | rating | rank | link | genre | |
---|---|---|---|---|---|---|---|---|
0 | 10102 | 사운드 오브 뮤직 | The Sound Of Music | 1965 | 9.40 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [멜로/로맨스, 뮤지컬, 드라마] |
1 | 35939 | 클래식 | The Classic | 2003 | 9.39 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [멜로/로맨스, 드라마] |
2 | 18847 | 타이타닉 | Titanic | 1997 | 9.38 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [멜로/로맨스, 드라마] |
3 | 39636 | 지금, 만나러 갑니다 | いま、会いにゆきます | 2004 | 9.34 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [멜로/로맨스, 드라마, 판타지] |
4 | 182348 | 로망 | Romang | 2019 | 9.30 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [멜로/로맨스] |
... | ... | ... | ... | ... | ... | ... | ... | ... |
195 | 40163 | 게스 후? | Guess Who | 2005 | 7.42 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [멜로/로맨스, 코미디] |
196 | 85842 | 네버엔딩 스토리 | Never Ending Story | 2012 | 7.41 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [멜로/로맨스, 코미디] |
197 | 69977 | 참을 수 없는. | 2010 | 2010 | 7.39 | 청소년 관람불가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [멜로/로맨스, 드라마] |
198 | 69952 | 호우시절 | 好雨時節 | 2009 | 7.38 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [멜로/로맨스] |
199 | 64195 | 기다리다 미쳐 | Crazy For Wait | 2007 | 7.37 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [멜로/로맨스] |
200 rows × 8 columns
romance_df.to_csv('data/romance_df.csv', index=False)
np.save('images/poster_romance.npy', poster_romance)
poster_r = np.load('images/poster_romance.npy')
print(poster_r.shape)
poster_r
(200, 256, 256, 3)
array([[[[ 20, 66, 160],
[ 20, 66, 160],
[ 20, 66, 160],
...,
[ 19, 86, 173],
[ 19, 86, 173],
[ 19, 86, 173]],
[[ 20, 66, 160],
[ 20, 66, 160],
[ 21, 67, 161],
...,
[ 20, 87, 174],
[ 19, 86, 173],
[ 20, 87, 174]],
[[ 21, 67, 161],
[ 20, 67, 161],
[ 21, 67, 161],
...,
[ 19, 86, 173],
[ 19, 86, 173],
[ 19, 86, 173]],
...,
[[ 38, 70, 33],
[ 38, 71, 34],
[ 41, 74, 35],
...,
[ 68, 97, 40],
[ 63, 94, 40],
[ 45, 80, 37]],
[[ 43, 79, 37],
[ 44, 80, 38],
[ 41, 79, 35],
...,
[ 42, 77, 35],
[ 42, 79, 36],
[ 46, 83, 39]],
[[ 41, 78, 35],
[ 41, 78, 35],
[ 40, 77, 34],
...,
[ 41, 78, 35],
[ 42, 79, 36],
[ 42, 79, 36]]],
[[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
...,
[[ 16, 12, 13],
[ 16, 12, 13],
[ 18, 14, 14],
...,
[ 5, 1, 2],
[ 3, 1, 2],
[ 18, 17, 17]],
[[ 19, 15, 16],
[ 19, 15, 16],
[ 20, 16, 15],
...,
[ 10, 6, 7],
[ 11, 7, 8],
[ 26, 22, 23]],
[[ 67, 63, 64],
[ 69, 65, 66],
[ 69, 65, 64],
...,
[ 69, 65, 66],
[ 71, 67, 68],
[ 79, 75, 76]]],
[[[106, 75, 74],
[103, 74, 70],
[106, 73, 73],
...,
[134, 96, 100],
[135, 96, 100],
[137, 96, 100]],
[[112, 82, 85],
[109, 79, 83],
[111, 79, 84],
...,
[144, 106, 116],
[143, 106, 117],
[144, 108, 119]],
[[113, 81, 84],
[111, 80, 83],
[112, 80, 82],
...,
[144, 105, 116],
[142, 105, 114],
[145, 106, 115]],
...,
[[ 3, 3, 3],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 3, 3, 3]],
[[ 3, 3, 3],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[ 0, 0, 0],
[ 3, 3, 3]],
[[ 5, 5, 5],
[ 2, 2, 2],
[ 2, 2, 2],
...,
[ 2, 2, 2],
[ 2, 2, 2],
[ 5, 5, 5]]],
...,
[[[ 34, 23, 19],
[ 34, 23, 19],
[ 34, 23, 19],
...,
[ 38, 27, 23],
[ 37, 26, 22],
[ 37, 26, 22]],
[[ 34, 23, 19],
[ 34, 23, 19],
[ 34, 23, 19],
...,
[ 37, 26, 22],
[ 36, 25, 20],
[ 35, 24, 19]],
[[ 34, 23, 19],
[ 34, 23, 19],
[ 34, 23, 19],
...,
[ 39, 28, 24],
[ 39, 28, 24],
[ 37, 26, 22]],
...,
[[112, 74, 64],
[140, 99, 85],
[149, 106, 91],
...,
[ 2, 0, 2],
[ 3, 0, 1],
[ 3, 1, 2]],
[[ 99, 64, 55],
[123, 85, 73],
[140, 99, 86],
...,
[ 4, 1, 2],
[ 4, 1, 2],
[ 6, 2, 3]],
[[ 81, 50, 42],
[107, 71, 61],
[122, 83, 72],
...,
[ 6, 2, 3],
[ 7, 3, 4],
[ 8, 3, 4]]],
[[[169, 202, 177],
[171, 204, 174],
[165, 203, 165],
...,
[166, 146, 148],
[166, 146, 148],
[166, 147, 149]],
[[163, 199, 173],
[161, 198, 171],
[155, 196, 163],
...,
[168, 148, 150],
[165, 146, 148],
[164, 145, 147]],
[[150, 194, 160],
[150, 192, 159],
[151, 193, 158],
...,
[168, 149, 151],
[165, 146, 148],
[162, 143, 145]],
...,
[[146, 131, 110],
[146, 131, 111],
[141, 128, 108],
...,
[ 55, 81, 54],
[ 53, 80, 53],
[ 53, 78, 52]],
[[139, 126, 107],
[139, 126, 107],
[138, 126, 107],
...,
[ 49, 76, 50],
[ 49, 76, 49],
[ 50, 73, 47]],
[[140, 129, 107],
[134, 122, 103],
[139, 126, 109],
...,
[ 47, 74, 47],
[ 49, 76, 48],
[ 52, 77, 49]]],
[[[ 82, 110, 59],
[ 80, 108, 57],
[ 80, 108, 57],
...,
[ 80, 108, 57],
[ 80, 108, 57],
[ 82, 110, 59]],
[[ 81, 109, 58],
[ 79, 107, 56],
[ 79, 107, 56],
...,
[ 79, 107, 56],
[ 79, 107, 56],
[ 81, 109, 58]],
[[ 82, 110, 59],
[ 79, 107, 56],
[ 79, 107, 56],
...,
[ 79, 107, 56],
[ 79, 107, 56],
[ 81, 109, 58]],
...,
[[167, 51, 50],
[188, 52, 53],
[194, 51, 54],
...,
[184, 42, 62],
[176, 40, 53],
[178, 98, 93]],
[[ 69, 17, 13],
[ 92, 25, 22],
[ 99, 27, 25],
...,
[189, 41, 64],
[180, 40, 57],
[174, 58, 63]],
[[ 64, 17, 13],
[ 81, 21, 18],
[ 99, 30, 27],
...,
[192, 56, 74],
[191, 41, 66],
[181, 44, 56]]]], dtype=uint8)
d. Comedy
driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
time.sleep(1)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
comedy_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
'rating', 'rank', 'link', 'genre'])
comedy_df
code | title_kor | title_eng | year | rating | rank | link | genre |
---|
poster_comedy = np.memmap('images/poster_comedy', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
print('page ' + str(l + 1) + ' crawling...')
# 단일 페이지 접속
driver.get(rank_url.format(genre=11, page=l + 1))
html = driver.page_source
time.sleep(1)
soup = BeautifulSoup(html, 'lxml')
for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
# link
link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
# rating
rating = soup.find_all('td', class_='point')[k].text
# code
code_start = re.search('code=', link).span()[1]
code = link[code_start:]
# 상세 페이지 접속
driver.get(link)
link_html = driver.page_source
time.sleep(1)
link_soup = BeautifulSoup(link_html, 'lxml')
# 영화 정보
movie_info = link_soup.find('div', class_='mv_info')
# title_kor
title_kor = movie_info.h3.a.text
# title_eng
title_eng = movie_info.strong.text.split(',')[0].strip()
# year
year = movie_info.strong.text.split(',')[-1].strip()
# rank
rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text
# genre
genre = []
for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
# poster
poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
poster_end = re.search('\?', poster_thumbnail).span()[0]
poster_link = poster_thumbnail[:poster_end]
poster_index = l * 50 + k
# gif로 받아지는 경우 3개 채널로 다시 만들어주어야 함
if np.asarray(Image.open(request.urlopen(poster_link))).ndim == 2:
im = Image.open(request.urlopen(poster_link)).resize((256, 256))
background = Image.new('RGB', im.size, (255, 255, 255))
background.paste(im)
poster_array = np.asarray(background)
else:
poster_array = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
poster_comedy[poster_index, :, :, :] = poster_array
# 데이터프레임에 추가
comedy_df = comedy_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
'year': year, 'rating': rating, 'rank': rank, 'link': link,
'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...
100%|██████████| 50/50 [02:07<00:00, 2.56s/it]
page 2 crawling...
100%|██████████| 50/50 [01:47<00:00, 2.16s/it]
page 3 crawling...
100%|██████████| 50/50 [02:10<00:00, 2.61s/it]
page 4 crawling...
100%|██████████| 50/50 [02:53<00:00, 3.47s/it]
Completed!
driver.close()
poster_comedy.flush()
comedy_df
code | title_kor | title_eng | year | rating | rank | link | genre | |
---|---|---|---|---|---|---|---|---|
0 | 18543 | 서유기 2 - 선리기연 | 西遊記 完結篇 之 仙履奇緣: A Chinese Odyssey Part Two - C... | 1994 | 9.35 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [코미디, 액션, 모험, 판타지, 멜로/로맨스] |
1 | 73372 | 세 얼간이 | 3 Idiots | 2009 | 9.35 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [코미디] |
2 | 19099 | 트루먼 쇼 | The Truman Show | 1998 | 9.34 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [코미디, 드라마, SF] |
3 | 87566 | 언터처블: 1%의 우정 | Intouchables | 2011 | 9.33 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [코미디, 드라마] |
4 | 16210 | 미세스 다웃파이어 | Mrs. Doubtfire | 1993 | 9.33 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [코미디, 가족, 드라마] |
... | ... | ... | ... | ... | ... | ... | ... | ... |
195 | 39397 | 윔블던 | Wimbledon | 2004 | 8.10 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [코미디, 멜로/로맨스] |
196 | 91073 | 박수건달 | Man on the Edge | 2012 | 8.10 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [코미디] |
197 | 98738 | 프란시스 하 | Frances Ha | 2012 | 8.10 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [코미디, 멜로/로맨스] |
198 | 62219 | 색즉시공 시즌 2 | Sex Is Zero 2 | 2007 | 8.09 | 청소년 관람불가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [코미디] |
199 | 74954 | 두 번의 결혼식과 한 번의 장례식 | Two Weddings And A Funeral | 2012 | 8.09 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [코미디, 멜로/로맨스] |
200 rows × 8 columns
comedy_df.to_csv('data/comedy_df.csv', index=False)
np.save('images/poster_comedy.npy', poster_comedy)
poster_c = np.load('images/poster_comedy.npy')
print(poster_c.shape)
poster_c
(200, 256, 256, 3)
array([[[[ 8, 8, 8],
[ 8, 8, 8],
[ 9, 9, 9],
...,
[ 8, 8, 8],
[ 8, 8, 8],
[ 8, 8, 8]],
[[ 9, 9, 9],
[ 9, 9, 9],
[ 9, 9, 9],
...,
[ 9, 9, 9],
[ 8, 8, 8],
[ 8, 8, 8]],
[[ 9, 9, 9],
[ 9, 9, 9],
[ 10, 10, 10],
...,
[ 9, 9, 9],
[ 9, 9, 9],
[ 9, 9, 9]],
...,
[[ 8, 7, 5],
[ 11, 9, 7],
[ 13, 10, 8],
...,
[ 31, 28, 24],
[ 30, 27, 23],
[ 30, 27, 23]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 44, 37, 31],
[ 47, 40, 33],
[ 48, 41, 34]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
[ 2, 1, 1],
[ 5, 4, 4],
[ 9, 8, 7]]],
[[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
...,
[[252, 252, 252],
[253, 253, 253],
[253, 253, 253],
...,
[252, 254, 253],
[252, 254, 253],
[252, 254, 253]],
[[251, 254, 253],
[253, 253, 253],
[252, 253, 252],
...,
[251, 253, 252],
[251, 253, 252],
[251, 253, 252]],
[[251, 253, 252],
[251, 252, 252],
[251, 253, 252],
...,
[251, 253, 252],
[251, 253, 252],
[252, 254, 253]]],
[[[244, 248, 252],
[244, 248, 252],
[244, 248, 252],
...,
[237, 246, 251],
[237, 246, 251],
[237, 246, 251]],
[[244, 248, 252],
[243, 248, 252],
[244, 248, 253],
...,
[237, 246, 251],
[237, 246, 251],
[237, 246, 251]],
[[244, 248, 252],
[243, 248, 252],
[244, 248, 252],
...,
[237, 246, 251],
[237, 246, 251],
[237, 246, 251]],
...,
[[111, 87, 98],
[ 77, 52, 65],
[ 98, 73, 84],
...,
[ 52, 24, 41],
[ 46, 20, 36],
[ 51, 24, 40]],
[[103, 80, 90],
[107, 81, 92],
[105, 78, 90],
...,
[ 54, 27, 43],
[ 49, 24, 39],
[ 54, 28, 44]],
[[ 73, 50, 61],
[ 96, 71, 82],
[ 90, 65, 76],
...,
[ 51, 26, 40],
[ 62, 36, 52],
[ 61, 36, 51]]],
...,
[[[119, 117, 107],
[ 20, 21, 16],
[ 21, 22, 17],
...,
[100, 100, 91],
[102, 102, 93],
[ 70, 71, 64]],
[[111, 110, 102],
[ 19, 19, 16],
[ 21, 22, 18],
...,
[ 73, 74, 68],
[ 76, 77, 70],
[ 36, 37, 31]],
[[112, 110, 103],
[ 19, 20, 17],
[ 21, 22, 17],
...,
[ 44, 45, 40],
[ 32, 33, 28],
[ 28, 28, 23]],
...,
[[ 28, 28, 23],
[ 28, 27, 22],
[ 24, 23, 19],
...,
[ 27, 28, 23],
[ 30, 31, 26],
[ 30, 31, 27]],
[[ 14, 14, 10],
[ 16, 16, 11],
[ 18, 17, 14],
...,
[ 26, 27, 22],
[ 32, 33, 28],
[ 34, 35, 30]],
[[ 64, 63, 55],
[ 43, 42, 36],
[ 18, 17, 12],
...,
[ 34, 35, 30],
[ 37, 38, 33],
[ 41, 42, 37]]],
[[[ 13, 0, 0],
[ 20, 0, 1],
[ 22, 2, 3],
...,
[ 26, 10, 9],
[ 70, 37, 33],
[ 25, 9, 9]],
[[ 9, 1, 1],
[ 10, 0, 0],
[ 17, 0, 1],
...,
[ 28, 12, 12],
[ 30, 13, 13],
[ 10, 0, 1]],
[[ 32, 14, 12],
[ 19, 6, 6],
[ 7, 0, 1],
...,
[ 29, 12, 11],
[ 12, 1, 1],
[ 13, 1, 1]],
...,
[[ 4, 0, 0],
[ 5, 0, 0],
[ 6, 0, 0],
...,
[ 2, 0, 1],
[ 2, 0, 1],
[ 2, 0, 1]],
[[ 4, 0, 0],
[ 5, 0, 0],
[ 6, 0, 0],
...,
[ 2, 0, 1],
[ 2, 0, 1],
[ 2, 0, 1]],
[[ 4, 0, 0],
[ 4, 0, 0],
[ 4, 0, 0],
...,
[ 1, 0, 0],
[ 2, 0, 1],
[ 2, 0, 1]]],
[[[227, 223, 215],
[225, 220, 213],
[223, 218, 212],
...,
[ 25, 11, 7],
[ 57, 41, 34],
[103, 86, 78]],
[[225, 220, 214],
[222, 217, 211],
[219, 214, 210],
...,
[ 17, 4, 2],
[ 27, 12, 7],
[ 54, 36, 27]],
[[218, 213, 208],
[220, 215, 210],
[218, 213, 209],
...,
[ 16, 3, 3],
[ 15, 2, 0],
[ 26, 8, 3]],
...,
[[ 61, 55, 54],
[ 62, 56, 57],
[ 63, 58, 59],
...,
[137, 117, 110],
[135, 113, 107],
[133, 113, 104]],
[[ 63, 57, 58],
[ 63, 57, 59],
[ 63, 57, 59],
...,
[136, 116, 109],
[137, 115, 108],
[137, 114, 106]],
[[ 61, 58, 60],
[ 61, 57, 58],
[ 62, 57, 57],
...,
[135, 113, 106],
[136, 113, 107],
[140, 117, 109]]]], dtype=uint8)
e. Animation
driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
time.sleep(1)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
animation_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
'rating', 'rank', 'link', 'genre'])
animation_df
code | title_kor | title_eng | year | rating | rank | link | genre |
---|
poster_animation = np.memmap('images/poster_animation', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
print('page ' + str(l + 1) + ' crawling...')
# 단일 페이지 접속
driver.get(rank_url.format(genre=15, page=l + 1))
html = driver.page_source
time.sleep(1)
soup = BeautifulSoup(html, 'lxml')
for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
# link
link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
# rating
rating = soup.find_all('td', class_='point')[k].text
# code
code_start = re.search('code=', link).span()[1]
code = link[code_start:]
# 상세 페이지 접속
driver.get(link)
link_html = driver.page_source
time.sleep(1)
link_soup = BeautifulSoup(link_html, 'lxml')
# 영화 정보
movie_info = link_soup.find('div', class_='mv_info')
# title_kor
title_kor = movie_info.h3.a.text
# title_eng
title_eng = movie_info.strong.text.split(',')[0].strip()
# year
year = movie_info.strong.text.split(',')[-1].strip()
# rank
try:
rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text
except:
rank = np.nan
# genre
genre = []
for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
# poster
poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
poster_end = re.search('\?', poster_thumbnail).span()[0]
poster_link = poster_thumbnail[:poster_end]
poster_index = l * 50 + k
# gif로 받아지는 경우 3개 채널로 다시 만들어주어야 함
if np.asarray(Image.open(request.urlopen(poster_link))).ndim == 2:
im = Image.open(request.urlopen(poster_link)).resize((256, 256))
background = Image.new('RGB', im.size, (255, 255, 255))
background.paste(im)
poster_array = np.asarray(background)
else:
poster_array = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
poster_animation[poster_index, :, :, :] = poster_array
# 데이터프레임에 추가
animation_df = animation_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
'year': year, 'rating': rating, 'rank': rank, 'link': link,
'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...
100%|██████████| 50/50 [03:46<00:00, 4.53s/it]
page 2 crawling...
100%|██████████| 50/50 [02:27<00:00, 2.95s/it]
page 3 crawling...
100%|██████████| 50/50 [03:27<00:00, 4.16s/it]
page 4 crawling...
100%|██████████| 50/50 [03:09<00:00, 3.78s/it]
Completed!
driver.close()
poster_animation.flush()
animation_df
code | title_kor | title_eng | year | rating | rank | link | genre | |
---|---|---|---|---|---|---|---|---|
0 | 69105 | 월-E | WALL-E | 2008 | 9.41 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [애니메이션, SF, 가족, 코미디, 멜로/로맨스, 모험] |
1 | 32686 | 센과 치히로의 행방불명 | 千と千尋の神隠し | 2001 | 9.39 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [애니메이션, 판타지, 모험, 가족] |
2 | 66463 | 토이 스토리 3 | Toy Story 3 | 2010 | 9.38 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [애니메이션, 모험, 코미디, 가족, 판타지] |
3 | 130850 | 주토피아 | Zootopia | 2016 | 9.35 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [애니메이션, 액션, 모험, 코미디, 가족] |
4 | 19303 | 모노노케 히메 | もののけ-: Mononoke Hime | 1997 | 9.35 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [애니메이션, 모험, 액션] |
... | ... | ... | ... | ... | ... | ... | ... | ... |
195 | 34449 | 아이스 에이지 | Ice Age | 2002 | 8.58 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [애니메이션, 모험, 가족, 판타지, 코미디] |
196 | 122581 | 극장판 도라에몽 진구의 아프리카 모험 : 베코와 5인의 탐험대 | 映画ドラえもん 新・のび太の大魔境 ~ペコと5人の探検隊~ | 2014 | 8.58 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [애니메이션, 모험] |
197 | 144355 | 감바의 대모험 | GAMBA ガンバと仲間たち | 2015 | 8.57 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [애니메이션] |
198 | 17230 | 포카혼타스 | Pocahontas | 1995 | 8.57 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [애니메이션, 가족, 모험, 드라마, 멜로/로맨스] |
199 | 134980 | 마이펫의 이중생활 | The Secret Life of Pets | 2016 | 8.56 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [애니메이션, 코미디, 가족] |
200 rows × 8 columns
animation_df.isnull().sum()
code 0
title_kor 0
title_eng 0
year 0
rating 0
rank 1
link 0
genre 0
dtype: int64
animation_df.loc[animation_df['rank'].isnull(), 'rank'] = '전체 관람가'
animation_df.isnull().sum()
code 0
title_kor 0
title_eng 0
year 0
rating 0
rank 0
link 0
genre 0
dtype: int64
animation_df.to_csv('data/animation_df.csv', index=False)
np.save('images/poster_animation.npy', poster_animation)
poster_a = np.load('images/poster_animation.npy')
print(poster_a.shape)
poster_a
(200, 256, 256, 3)
array([[[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
...,
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[254, 254, 254],
[255, 255, 255],
[255, 255, 255]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]]],
[[[149, 21, 47],
[126, 13, 38],
[100, 14, 37],
...,
[ 65, 43, 43],
[ 67, 42, 44],
[150, 27, 43]],
[[190, 38, 71],
[150, 22, 42],
[115, 9, 37],
...,
[ 70, 43, 44],
[ 70, 42, 43],
[156, 29, 43]],
[[220, 76, 125],
[165, 26, 46],
[120, 10, 36],
...,
[ 59, 44, 44],
[ 77, 42, 44],
[170, 36, 47]],
...,
[[ 43, 55, 64],
[ 44, 55, 65],
[ 45, 55, 65],
...,
[ 86, 80, 81],
[ 86, 80, 81],
[ 86, 80, 81]],
[[ 44, 54, 64],
[ 43, 55, 65],
[ 44, 54, 64],
...,
[ 86, 80, 81],
[ 86, 80, 80],
[ 86, 80, 81]],
[[ 44, 54, 64],
[ 44, 54, 64],
[ 44, 54, 64],
...,
[ 85, 79, 80],
[ 85, 79, 79],
[ 85, 79, 81]]],
[[[ 0, 19, 56],
[ 0, 18, 56],
[ 0, 21, 58],
...,
[ 2, 36, 81],
[ 2, 50, 95],
[ 2, 65, 110]],
[[ 0, 19, 57],
[ 1, 19, 57],
[ 1, 23, 60],
...,
[ 1, 70, 115],
[ 1, 64, 109],
[ 2, 51, 88]],
[[ 1, 22, 59],
[ 1, 25, 62],
[ 2, 28, 67],
...,
[ 1, 48, 87],
[ 1, 42, 80],
[ 0, 31, 64]],
...,
[[ 19, 60, 108],
[ 18, 60, 109],
[ 21, 64, 114],
...,
[ 8, 44, 83],
[ 10, 46, 85],
[ 12, 48, 88]],
[[ 19, 59, 109],
[ 19, 61, 109],
[ 15, 56, 104],
...,
[ 7, 44, 85],
[ 3, 41, 79],
[ 7, 45, 84]],
[[ 11, 53, 97],
[ 14, 55, 100],
[ 13, 54, 99],
...,
[ 4, 43, 82],
[ 2, 40, 77],
[ 9, 44, 84]]],
...,
[[[255, 255, 245],
[255, 255, 245],
[255, 255, 245],
...,
[255, 255, 245],
[255, 255, 245],
[255, 255, 245]],
[[255, 255, 245],
[255, 255, 245],
[255, 255, 245],
...,
[255, 255, 245],
[255, 255, 245],
[255, 255, 245]],
[[255, 255, 245],
[255, 255, 245],
[255, 255, 245],
...,
[255, 255, 245],
[255, 255, 245],
[255, 255, 245]],
...,
[[123, 158, 174],
[127, 162, 178],
[119, 156, 173],
...,
[186, 199, 175],
[163, 177, 151],
[165, 173, 138]],
[[ 91, 140, 163],
[ 90, 139, 161],
[ 97, 148, 163],
...,
[221, 223, 207],
[159, 157, 103],
[170, 180, 138]],
[[112, 161, 149],
[121, 171, 145],
[121, 171, 146],
...,
[241, 246, 241],
[179, 182, 134],
[217, 218, 182]]],
[[[229, 134, 154],
[230, 135, 155],
[229, 134, 154],
...,
[238, 127, 170],
[237, 126, 168],
[239, 128, 170]],
[[231, 136, 155],
[232, 138, 156],
[231, 137, 155],
...,
[238, 127, 169],
[237, 126, 168],
[238, 127, 169]],
[[234, 140, 158],
[233, 139, 156],
[233, 139, 156],
...,
[238, 126, 169],
[238, 126, 169],
[239, 126, 169]],
...,
[[ 26, 18, 16],
[ 24, 16, 14],
[ 22, 14, 12],
...,
[254, 132, 145],
[253, 131, 144],
[253, 132, 145]],
[[ 24, 16, 14],
[ 23, 15, 13],
[ 22, 14, 12],
...,
[252, 129, 147],
[253, 131, 148],
[253, 131, 148]],
[[ 23, 15, 13],
[ 23, 15, 12],
[ 23, 15, 13],
...,
[253, 130, 150],
[254, 129, 150],
[252, 126, 148]]],
[[[ 95, 109, 162],
[ 92, 108, 168],
[ 92, 108, 167],
...,
[ 59, 86, 151],
[ 58, 86, 151],
[ 65, 88, 146]],
[[100, 116, 176],
[100, 119, 186],
[100, 118, 186],
...,
[ 64, 93, 167],
[ 63, 93, 168],
[ 68, 94, 160]],
[[100, 117, 175],
[ 99, 118, 186],
[100, 116, 185],
...,
[ 64, 92, 166],
[ 63, 92, 167],
[ 69, 94, 159]],
...,
[[120, 84, 82],
[121, 80, 77],
[117, 77, 76],
...,
[111, 75, 73],
[118, 79, 77],
[112, 80, 78]],
[[118, 85, 84],
[131, 91, 90],
[123, 84, 83],
...,
[123, 84, 82],
[131, 91, 89],
[115, 81, 79]],
[[122, 89, 88],
[117, 79, 79],
[117, 79, 77],
...,
[119, 81, 79],
[120, 82, 80],
[128, 93, 91]]]], dtype=uint8)
f. Crime
driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
time.sleep(1)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
crime_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
'rating', 'rank', 'link', 'genre'])
crime_df
code | title_kor | title_eng | year | rating | rank | link | genre |
---|
poster_crime = np.memmap('images/poster_crime', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
print('page ' + str(l + 1) + ' crawling...')
# 단일 페이지 접속
driver.get(rank_url.format(genre=16, page=l + 1))
html = driver.page_source
time.sleep(1)
soup = BeautifulSoup(html, 'lxml')
for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
# link
link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
# rating
rating = soup.find_all('td', class_='point')[k].text
# code
code_start = re.search('code=', link).span()[1]
code = link[code_start:]
# 상세 페이지 접속
driver.get(link)
link_html = driver.page_source
time.sleep(1)
link_soup = BeautifulSoup(link_html, 'lxml')
# 영화 정보
movie_info = link_soup.find('div', class_='mv_info')
# title_kor
title_kor = movie_info.h3.a.text
# title_eng
title_eng = movie_info.strong.text.split(',')[0].strip()
# year
year = movie_info.strong.text.split(',')[-1].strip()
# rank
try:
rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text
except:
rank = np.nan
# genre
genre = []
for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
# poster
poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
poster_end = re.search('\?', poster_thumbnail).span()[0]
poster_link = poster_thumbnail[:poster_end]
poster_index = l * 50 + k
# gif로 받아지는 경우 3개 채널로 다시 만들어주어야 함
if np.asarray(Image.open(request.urlopen(poster_link))).ndim == 2:
im = Image.open(request.urlopen(poster_link)).resize((256, 256))
background = Image.new('RGB', im.size, (255, 255, 255))
background.paste(im)
poster_array = np.asarray(background)
else:
poster_array = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
poster_crime[poster_index, :, :, :] = poster_array
# 데이터프레임에 추가
crime_df = crime_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
'year': year, 'rating': rating, 'rank': rank, 'link': link,
'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...
100%|██████████| 50/50 [03:36<00:00, 4.33s/it]
page 2 crawling...
100%|██████████| 50/50 [02:01<00:00, 2.44s/it]
page 3 crawling...
100%|██████████| 50/50 [01:57<00:00, 2.35s/it]
page 4 crawling...
100%|██████████| 50/50 [02:37<00:00, 3.15s/it]
Completed!
driver.close()
poster_crime.flush()
crime_df
code | title_kor | title_eng | year | rating | rank | link | genre | |
---|---|---|---|---|---|---|---|---|
0 | 35901 | 살인의 추억 | Memories Of Murder | 2003 | 9.40 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [범죄, 미스터리, 스릴러, 코미디, 드라마] |
1 | 17170 | 레옹 | Leon | 1994 | 9.37 | 청소년 관람불가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [범죄, 액션, 드라마] |
2 | 29657 | 프리퀀시 | Frequency | 2000 | 9.32 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [범죄, 드라마, SF, 스릴러] |
3 | 51462 | 그랜 토리노 | Gran Torino | 2008 | 9.23 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [범죄, 드라마] |
4 | 10561 | 대부 3 | Mario Puzo's The Godfather Part III | 1990 | 9.21 | 청소년 관람불가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [범죄, 드라마] |
... | ... | ... | ... | ... | ... | ... | ... | ... |
195 | 43568 | 모노폴리 | Monopoly | 2006 | 6.42 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [범죄, 스릴러] |
196 | 69742 | 킬 위드 미 | Untraceable | 2008 | 6.36 | 청소년 관람불가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [범죄, 스릴러] |
197 | 154112 | 불한당: 나쁜 놈들의 세상 | The Merciless | 2016 | 6.35 | 청소년 관람불가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [범죄, 액션, 드라마] |
198 | 157297 | 마약왕 | THE DRUG KING | 2017 | 6.33 | 청소년 관람불가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [범죄, 드라마] |
199 | 65340 | 쏘우 4 | Saw IV | 2007 | 6.32 | 청소년 관람불가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [범죄, 스릴러, 공포] |
200 rows × 8 columns
crime_df.to_csv('data/crime_df.csv', index=False)
np.save('images/poster_crime.npy', poster_crime)
poster_cr = np.load('images/poster_crime.npy')
print(poster_cr.shape)
poster_cr
(200, 256, 256, 3)
array([[[[ 96, 115, 119],
[ 95, 114, 118],
[ 94, 114, 118],
...,
[ 67, 85, 89],
[ 67, 85, 89],
[ 62, 80, 84]],
[[ 95, 114, 118],
[ 98, 116, 121],
[ 99, 119, 122],
...,
[ 62, 80, 84],
[ 66, 84, 88],
[ 72, 90, 94]],
[[ 99, 118, 122],
[ 99, 118, 122],
[ 99, 118, 122],
...,
[ 73, 91, 95],
[ 65, 83, 87],
[ 66, 84, 88]],
...,
[[ 5, 7, 6],
[ 5, 7, 6],
[ 5, 7, 6],
...,
[ 14, 23, 23],
[ 12, 20, 20],
[ 21, 30, 30]],
[[ 5, 7, 6],
[ 5, 7, 6],
[ 5, 7, 6],
...,
[ 33, 42, 41],
[ 22, 31, 30],
[ 25, 34, 33]],
[[ 7, 9, 8],
[ 7, 9, 8],
[ 7, 9, 8],
...,
[ 41, 49, 51],
[ 44, 52, 53],
[ 37, 45, 47]]],
[[[ 19, 31, 58],
[ 62, 74, 90],
[120, 150, 182],
...,
[123, 155, 188],
[120, 145, 171],
[ 87, 101, 119]],
[[ 33, 41, 58],
[ 79, 90, 107],
[116, 151, 185],
...,
[114, 148, 179],
[106, 127, 147],
[ 80, 89, 106]],
[[ 39, 49, 64],
[ 86, 100, 122],
[116, 153, 188],
...,
[ 95, 118, 145],
[ 97, 113, 132],
[ 80, 91, 107]],
...,
[[128, 105, 76],
[127, 103, 73],
[131, 104, 75],
...,
[ 95, 87, 71],
[ 95, 88, 72],
[ 94, 86, 71]],
[[133, 104, 73],
[135, 104, 70],
[136, 107, 73],
...,
[ 97, 87, 70],
[ 94, 87, 71],
[ 94, 87, 69]],
[[137, 108, 73],
[138, 107, 78],
[140, 109, 78],
...,
[ 96, 86, 70],
[ 97, 88, 73],
[ 94, 86, 68]]],
[[[209, 151, 141],
[207, 148, 139],
[204, 143, 136],
...,
[ 87, 46, 55],
[ 98, 54, 44],
[ 90, 42, 46]],
[[223, 187, 186],
[235, 215, 204],
[232, 201, 187],
...,
[143, 101, 66],
[113, 75, 55],
[123, 70, 53]],
[[225, 192, 178],
[231, 209, 194],
[239, 222, 207],
...,
[143, 100, 70],
[ 90, 55, 44],
[144, 80, 58]],
...,
[[ 28, 15, 22],
[ 25, 12, 19],
[ 27, 14, 21],
...,
[ 58, 47, 43],
[ 60, 49, 45],
[ 64, 54, 50]],
[[ 30, 17, 24],
[ 26, 13, 20],
[ 32, 19, 26],
...,
[ 56, 46, 42],
[ 58, 47, 43],
[ 61, 49, 46]],
[[ 29, 16, 23],
[ 30, 17, 24],
[ 29, 16, 24],
...,
[ 48, 40, 37],
[ 51, 39, 37],
[ 66, 51, 50]]],
...,
[[[164, 127, 119],
[119, 96, 98],
[ 35, 37, 44],
...,
[210, 221, 228],
[209, 220, 225],
[207, 216, 229]],
[[148, 110, 103],
[120, 92, 91],
[ 45, 43, 47],
...,
[213, 223, 232],
[211, 221, 229],
[209, 219, 230]],
[[ 98, 70, 68],
[ 90, 67, 68],
[ 46, 41, 42],
...,
[218, 228, 237],
[214, 224, 233],
[213, 223, 232]],
...,
[[ 10, 18, 20],
[ 10, 18, 20],
[ 12, 20, 22],
...,
[ 18, 25, 35],
[ 18, 25, 35],
[ 18, 25, 35]],
[[ 10, 18, 20],
[ 11, 19, 21],
[ 12, 20, 22],
...,
[ 18, 25, 35],
[ 18, 25, 35],
[ 18, 25, 35]],
[[ 10, 18, 20],
[ 10, 18, 20],
[ 10, 18, 20],
...,
[ 18, 25, 35],
[ 18, 25, 35],
[ 18, 25, 35]]],
[[[ 84, 66, 41],
[ 92, 75, 49],
[104, 86, 54],
...,
[157, 76, 35],
[159, 77, 36],
[160, 82, 40]],
[[ 88, 68, 43],
[ 96, 76, 49],
[108, 88, 56],
...,
[154, 70, 33],
[157, 73, 35],
[161, 80, 38]],
[[ 88, 68, 43],
[ 97, 77, 48],
[112, 89, 58],
...,
[157, 72, 34],
[158, 71, 32],
[160, 74, 36]],
...,
[[ 15, 10, 7],
[ 15, 10, 7],
[ 14, 10, 7],
...,
[ 8, 4, 3],
[ 8, 4, 3],
[ 7, 3, 2]],
[[ 16, 11, 8],
[ 15, 10, 7],
[ 13, 9, 7],
...,
[ 8, 4, 3],
[ 8, 4, 3],
[ 8, 4, 3]],
[[ 15, 10, 7],
[ 14, 9, 7],
[ 14, 10, 8],
...,
[ 8, 4, 3],
[ 8, 4, 3],
[ 8, 4, 3]]],
[[[227, 222, 219],
[227, 222, 219],
[227, 222, 219],
...,
[188, 179, 171],
[192, 183, 174],
[191, 182, 173]],
[[227, 222, 219],
[227, 222, 219],
[227, 222, 219],
...,
[186, 178, 167],
[188, 180, 170],
[189, 180, 171]],
[[227, 222, 219],
[227, 222, 219],
[227, 222, 219],
...,
[181, 173, 162],
[183, 174, 164],
[185, 176, 167]],
...,
[[252, 250, 251],
[250, 248, 249],
[248, 246, 247],
...,
[ 2, 2, 2],
[ 2, 2, 2],
[ 3, 3, 3]],
[[253, 252, 252],
[252, 250, 251],
[250, 248, 249],
...,
[ 3, 3, 3],
[ 4, 4, 4],
[ 3, 3, 3]],
[[254, 254, 254],
[253, 252, 253],
[253, 251, 252],
...,
[ 4, 4, 4],
[ 6, 6, 6],
[ 5, 5, 5]]]], dtype=uint8)
g. Action
driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
time.sleep(1)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
action_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
'rating', 'rank', 'link', 'genre'])
action_df
code | title_kor | title_eng | year | rating | rank | link | genre |
---|
poster_action = np.memmap('images/poster_action', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
print('page ' + str(l + 1) + ' crawling...')
# 단일 페이지 접속
driver.get(rank_url.format(genre=19, page=l + 1))
html = driver.page_source
time.sleep(1)
soup = BeautifulSoup(html, 'lxml')
for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
# link
link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
# rating
rating = soup.find_all('td', class_='point')[k].text
# code
code_start = re.search('code=', link).span()[1]
code = link[code_start:]
# 상세 페이지 접속
driver.get(link)
link_html = driver.page_source
time.sleep(1)
link_soup = BeautifulSoup(link_html, 'lxml')
# 영화 정보
movie_info = link_soup.find('div', class_='mv_info')
# title_kor
title_kor = movie_info.h3.a.text
# title_eng
title_eng = movie_info.strong.text.split(',')[0].strip()
# year
year = movie_info.strong.text.split(',')[-1].strip()
# rank
try:
rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text
except:
rank = np.nan
# genre
genre = []
for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
# poster
poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
poster_end = re.search('\?', poster_thumbnail).span()[0]
poster_link = poster_thumbnail[:poster_end]
poster_index = l * 50 + k
# gif로 받아지는 경우 3개 채널로 다시 만들어주어야 함
if np.asarray(Image.open(request.urlopen(poster_link))).ndim == 2:
im = Image.open(request.urlopen(poster_link)).resize((256, 256))
background = Image.new('RGB', im.size, (255, 255, 255))
background.paste(im)
poster_array = np.asarray(background)
else:
poster_array = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
poster_action[poster_index, :, :, :] = poster_array
# 데이터프레임에 추가
action_df = action_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
'year': year, 'rating': rating, 'rank': rank, 'link': link,
'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...
100%|██████████| 50/50 [03:05<00:00, 3.71s/it]
page 2 crawling...
100%|██████████| 50/50 [03:25<00:00, 4.11s/it]
page 3 crawling...
100%|██████████| 50/50 [03:28<00:00, 4.17s/it]
page 4 crawling...
100%|██████████| 50/50 [03:23<00:00, 4.08s/it]
Completed!
driver.close()
poster_action.flush()
action_df
code | title_kor | title_eng | year | rating | rank | link | genre | |
---|---|---|---|---|---|---|---|---|
0 | 181710 | 포드 V 페라리 | FORD v FERRARI | 2019 | 9.49 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 드라마] |
1 | 29217 | 글래디에이터 | Gladiator | 2000 | 9.39 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 드라마] |
2 | 136900 | 어벤져스: 엔드게임 | Avengers: Endgame | 2019 | 9.38 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, SF] |
3 | 92125 | 헌터 킬러 | Hunter Killer | 2018 | 9.37 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 스릴러] |
4 | 37886 | 클레멘타인 | Clementine | 2004 | 9.35 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 드라마] |
... | ... | ... | ... | ... | ... | ... | ... | ... |
195 | 41334 | 옹박 - 두번째 미션 | The Protector | 2005 | 8.28 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 범죄, 드라마, 스릴러] |
196 | 162249 | 램페이지 | RAMPAGE | 2018 | 8.28 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 모험] |
197 | 82473 | 캐리비안의 해적: 죽은 자는 말이 없다 | Pirates of the Caribbean: Dead Men Tell No Tales | 2017 | 8.28 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 모험, 코미디, 판타지] |
198 | 31606 | 킬러들의 수다 | Guns & Talks | 2001 | 8.28 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 드라마, 코미디] |
199 | 51082 | 7급 공무원 | 7th Grade Civil Servant | 2009 | 8.28 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 코미디] |
200 rows × 8 columns
action_df.to_csv('data/action_df.csv', index=False)
np.save('images/poster_action.npy', poster_action)
poster_ac = np.load('images/poster_action.npy')
print(poster_ac.shape)
poster_ac
(200, 256, 256, 3)
array([[[[171, 192, 215],
[171, 193, 215],
[171, 193, 215],
...,
[183, 179, 175],
[175, 172, 167],
[170, 167, 162]],
[[172, 194, 217],
[172, 193, 216],
[171, 192, 216],
...,
[180, 177, 172],
[182, 179, 174],
[185, 182, 177]],
[[167, 191, 213],
[169, 192, 215],
[168, 192, 215],
...,
[207, 203, 199],
[209, 205, 201],
[211, 207, 202]],
...,
[[ 65, 56, 47],
[ 65, 56, 47],
[ 67, 58, 49],
...,
[ 17, 18, 21],
[ 16, 17, 19],
[ 14, 15, 18]],
[[ 65, 56, 47],
[ 66, 57, 48],
[ 65, 56, 47],
...,
[ 17, 18, 21],
[ 16, 17, 20],
[ 15, 16, 19]],
[[ 63, 55, 46],
[ 64, 55, 46],
[ 64, 55, 46],
...,
[ 17, 18, 20],
[ 15, 16, 19],
[ 16, 17, 21]]],
[[[249, 255, 252],
[238, 242, 239],
[231, 232, 230],
...,
[228, 228, 228],
[240, 240, 240],
[255, 255, 255]],
[[255, 255, 255],
[145, 144, 142],
[ 55, 50, 49],
...,
[ 30, 30, 30],
[135, 135, 135],
[255, 255, 255]],
[[255, 255, 255],
[135, 129, 129],
[ 31, 21, 22],
...,
[ 0, 0, 0],
[119, 119, 119],
[255, 255, 255]],
...,
[[255, 255, 255],
[117, 117, 117],
[ 0, 0, 0],
...,
[ 0, 0, 0],
[117, 117, 117],
[255, 255, 255]],
[[255, 255, 255],
[134, 134, 134],
[ 29, 29, 29],
...,
[ 29, 29, 29],
[133, 133, 133],
[255, 255, 255]],
[[255, 255, 255],
[238, 238, 238],
[228, 228, 228],
...,
[227, 227, 227],
[239, 239, 239],
[255, 255, 255]]],
[[[ 6, 4, 19],
[ 5, 4, 12],
[ 4, 3, 11],
...,
[ 1, 1, 1],
[ 1, 1, 1],
[ 1, 1, 1]],
[[ 24, 18, 58],
[ 6, 4, 18],
[ 5, 4, 11],
...,
[ 1, 1, 1],
[ 1, 1, 1],
[ 1, 1, 1]],
[[ 38, 30, 89],
[ 15, 11, 42],
[ 5, 4, 14],
...,
[ 1, 1, 1],
[ 1, 1, 1],
[ 1, 1, 1]],
...,
[[ 1, 1, 2],
[ 1, 1, 2],
[ 1, 1, 1],
...,
[ 4, 3, 9],
[ 2, 1, 6],
[ 2, 2, 5]],
[[ 1, 1, 2],
[ 1, 1, 1],
[ 1, 1, 1],
...,
[ 4, 3, 9],
[ 2, 2, 7],
[ 2, 1, 5]],
[[ 2, 1, 4],
[ 1, 1, 1],
[ 1, 1, 1],
...,
[ 3, 2, 7],
[ 2, 1, 5],
[ 1, 1, 3]]],
...,
[[[107, 147, 155],
[106, 146, 153],
[107, 146, 154],
...,
[214, 222, 204],
[214, 223, 206],
[213, 222, 204]],
[[106, 146, 153],
[106, 147, 154],
[105, 147, 153],
...,
[215, 222, 204],
[214, 222, 206],
[216, 223, 204]],
[[106, 146, 154],
[107, 147, 154],
[104, 146, 152],
...,
[215, 222, 204],
[215, 222, 205],
[210, 220, 204]],
...,
[[ 16, 24, 34],
[ 16, 23, 32],
[ 16, 23, 32],
...,
[ 16, 21, 28],
[ 16, 20, 29],
[ 16, 20, 29]],
[[ 18, 23, 34],
[ 17, 23, 34],
[ 17, 23, 32],
...,
[ 17, 21, 30],
[ 14, 20, 30],
[ 15, 20, 30]],
[[ 18, 22, 33],
[ 18, 23, 35],
[ 17, 22, 33],
...,
[ 20, 22, 30],
[ 45, 29, 30],
[ 28, 22, 29]]],
[[[237, 254, 248],
[250, 250, 250],
[255, 249, 253],
...,
[252, 250, 250],
[254, 250, 250],
[254, 252, 254]],
[[242, 254, 249],
[252, 251, 251],
[244, 243, 245],
...,
[236, 244, 239],
[253, 253, 253],
[254, 252, 255]],
[[248, 253, 251],
[255, 253, 254],
[226, 232, 232],
...,
[163, 172, 164],
[238, 239, 241],
[254, 253, 255]],
...,
[[252, 252, 252],
[255, 255, 255],
[188, 188, 188],
...,
[113, 113, 113],
[220, 220, 220],
[255, 255, 255]],
[[251, 251, 251],
[251, 251, 251],
[249, 249, 249],
...,
[152, 152, 152],
[225, 225, 225],
[255, 255, 255]],
[[252, 252, 252],
[253, 253, 253],
[254, 254, 254],
...,
[252, 252, 252],
[252, 252, 252],
[255, 255, 255]]],
[[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[179, 47, 35],
[179, 47, 35],
[179, 47, 35]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[180, 48, 33],
[179, 47, 33],
[179, 47, 33]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[181, 49, 32],
[181, 49, 35],
[181, 49, 35]],
...,
[[ 96, 4, 8],
[ 58, 2, 1],
[ 15, 1, 0],
...,
[136, 27, 33],
[138, 27, 33],
[138, 27, 33]],
[[ 82, 4, 4],
[ 32, 1, 2],
[ 12, 2, 3],
...,
[136, 27, 32],
[138, 27, 33],
[138, 27, 33]],
[[ 52, 1, 3],
[ 10, 1, 2],
[ 38, 4, 6],
...,
[138, 27, 33],
[137, 26, 32],
[138, 27, 33]]]], dtype=uint8)
Total Dataframe
poster를 제외한 나머지 변수들 포함된 데이터
total_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
'rating', 'rank', 'link', 'genre'])
total_df
code | title_kor | title_eng | year | rating | rank | link | genre |
---|
total_df = total_df.append(drama_df).append(horror_df).append(romance_df)\
.append(comedy_df).append(animation_df).append(crime_df).append(action_df)
total_df = total_df.reset_index(drop=True)
total_df
code | title_kor | title_eng | year | rating | rank | link | genre | |
---|---|---|---|---|---|---|---|---|
0 | 171539 | 그린 북 | Green Book | 2018 | 9.59 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마] |
1 | 174830 | 가버나움 | Capharnaum | 2018 | 9.58 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마] |
2 | 151196 | 원더 | Wonder | 2017 | 9.49 | 전체 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마] |
3 | 169240 | 아일라 | Ayla: The Daughter of War | 2017 | 9.48 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마, 전쟁] |
4 | 157243 | 당갈 | Dangal | 2016 | 9.47 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [드라마, 액션] |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1395 | 41334 | 옹박 - 두번째 미션 | The Protector | 2005 | 8.28 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 범죄, 드라마, 스릴러] |
1396 | 162249 | 램페이지 | RAMPAGE | 2018 | 8.28 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 모험] |
1397 | 82473 | 캐리비안의 해적: 죽은 자는 말이 없다 | Pirates of the Caribbean: Dead Men Tell No Tales | 2017 | 8.28 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 모험, 코미디, 판타지] |
1398 | 31606 | 킬러들의 수다 | Guns & Talks | 2001 | 8.28 | 15세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 드라마, 코미디] |
1399 | 51082 | 7급 공무원 | 7th Grade Civil Servant | 2009 | 8.28 | 12세 관람가 | https://movie.naver.com/movie/bi/mi/basic.nhn?... | [액션, 코미디] |
1400 rows × 8 columns
total_df.to_csv('data/total_df.csv', index=False)
poster 데이터
poster_total = np.r_[np.r_[np.r_[np.r_[np.r_[np.r_[poster_d, poster_h], poster_r], poster_c], poster_a], poster_cr], poster_ac]
print(poster_total.shape)
poster_total
(1400, 256, 256, 3)
array([[[[ 0, 100, 116],
[ 0, 100, 116],
[ 0, 100, 117],
...,
[ 1, 115, 141],
[ 0, 116, 141],
[ 1, 117, 142]],
[[ 0, 100, 116],
[ 0, 100, 115],
[ 0, 101, 118],
...,
[ 0, 115, 141],
[ 0, 116, 141],
[ 0, 116, 141]],
[[ 0, 101, 116],
[ 0, 101, 116],
[ 0, 101, 118],
...,
[ 1, 117, 142],
[ 1, 117, 142],
[ 1, 117, 142]],
...,
[[ 0, 106, 130],
[ 1, 107, 131],
[ 0, 108, 132],
...,
[ 0, 105, 127],
[ 0, 105, 127],
[ 0, 105, 127]],
[[ 0, 106, 130],
[ 0, 106, 130],
[ 1, 107, 131],
...,
[ 1, 104, 127],
[ 1, 104, 126],
[ 1, 104, 127]],
[[ 0, 106, 130],
[ 0, 106, 130],
[ 1, 107, 131],
...,
[ 1, 104, 127],
[ 1, 104, 127],
[ 2, 104, 127]]],
[[[227, 221, 218],
[231, 231, 227],
[237, 238, 239],
...,
[ 87, 56, 93],
[ 89, 58, 94],
[ 89, 57, 92]],
[[231, 229, 226],
[232, 231, 227],
[237, 237, 236],
...,
[ 89, 58, 95],
[ 88, 58, 93],
[ 89, 58, 94]],
[[236, 235, 233],
[236, 236, 235],
[237, 237, 236],
...,
[ 88, 58, 96],
[ 88, 58, 95],
[ 89, 58, 94]],
...,
[[104, 129, 151],
[105, 128, 150],
[106, 130, 152],
...,
[125, 125, 84],
[123, 124, 94],
[ 97, 104, 102]],
[[103, 129, 151],
[102, 128, 150],
[104, 129, 152],
...,
[115, 118, 87],
[111, 115, 91],
[ 87, 96, 99]],
[[100, 129, 149],
[100, 128, 149],
[103, 128, 150],
...,
[106, 112, 85],
[ 97, 106, 86],
[ 79, 85, 93]]],
[[[242, 225, 217],
[241, 223, 213],
[241, 222, 212],
...,
[250, 250, 231],
[253, 252, 248],
[253, 253, 252]],
[[252, 252, 252],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
[[247, 234, 229],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
...,
[[100, 97, 86],
[ 98, 95, 85],
[ 96, 95, 85],
...,
[114, 102, 86],
[111, 98, 79],
[114, 97, 77]],
[[100, 99, 90],
[ 89, 88, 79],
[ 76, 73, 65],
...,
[114, 100, 81],
[118, 102, 83],
[119, 100, 80]],
[[ 90, 89, 83],
[ 89, 86, 80],
[ 88, 83, 76],
...,
[102, 90, 73],
[104, 92, 76],
[103, 88, 74]]],
...,
[[[107, 147, 155],
[106, 146, 153],
[107, 146, 154],
...,
[214, 222, 204],
[214, 223, 206],
[213, 222, 204]],
[[106, 146, 153],
[106, 147, 154],
[105, 147, 153],
...,
[215, 222, 204],
[214, 222, 206],
[216, 223, 204]],
[[106, 146, 154],
[107, 147, 154],
[104, 146, 152],
...,
[215, 222, 204],
[215, 222, 205],
[210, 220, 204]],
...,
[[ 16, 24, 34],
[ 16, 23, 32],
[ 16, 23, 32],
...,
[ 16, 21, 28],
[ 16, 20, 29],
[ 16, 20, 29]],
[[ 18, 23, 34],
[ 17, 23, 34],
[ 17, 23, 32],
...,
[ 17, 21, 30],
[ 14, 20, 30],
[ 15, 20, 30]],
[[ 18, 22, 33],
[ 18, 23, 35],
[ 17, 22, 33],
...,
[ 20, 22, 30],
[ 45, 29, 30],
[ 28, 22, 29]]],
[[[237, 254, 248],
[250, 250, 250],
[255, 249, 253],
...,
[252, 250, 250],
[254, 250, 250],
[254, 252, 254]],
[[242, 254, 249],
[252, 251, 251],
[244, 243, 245],
...,
[236, 244, 239],
[253, 253, 253],
[254, 252, 255]],
[[248, 253, 251],
[255, 253, 254],
[226, 232, 232],
...,
[163, 172, 164],
[238, 239, 241],
[254, 253, 255]],
...,
[[252, 252, 252],
[255, 255, 255],
[188, 188, 188],
...,
[113, 113, 113],
[220, 220, 220],
[255, 255, 255]],
[[251, 251, 251],
[251, 251, 251],
[249, 249, 249],
...,
[152, 152, 152],
[225, 225, 225],
[255, 255, 255]],
[[252, 252, 252],
[253, 253, 253],
[254, 254, 254],
...,
[252, 252, 252],
[252, 252, 252],
[255, 255, 255]]],
[[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[179, 47, 35],
[179, 47, 35],
[179, 47, 35]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[180, 48, 33],
[179, 47, 33],
[179, 47, 33]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[181, 49, 32],
[181, 49, 35],
[181, 49, 35]],
...,
[[ 96, 4, 8],
[ 58, 2, 1],
[ 15, 1, 0],
...,
[136, 27, 33],
[138, 27, 33],
[138, 27, 33]],
[[ 82, 4, 4],
[ 32, 1, 2],
[ 12, 2, 3],
...,
[136, 27, 32],
[138, 27, 33],
[138, 27, 33]],
[[ 52, 1, 3],
[ 10, 1, 2],
[ 38, 4, 6],
...,
[138, 27, 33],
[137, 26, 32],
[138, 27, 33]]]], dtype=uint8)
np.save('images/poster_total.npy', poster_total)