Year Prediction by Movie Posters using CNN - 2

1. Raw Data Scraping

Local PC에서 작업

Load Packages

import requests
from urllib import request
from bs4 import BeautifulSoup
from PIL import Image
import re
import pandas as pd
import numpy as np
from tqdm import tqdm
from selenium import webdriver
import time
tqdm().pandas()
0it [00:00, ?it/s]
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tqdm/std.py:658: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  from pandas import Panel

Scraping Order

  1. 각 장르별로 데이터프레임을 따로 구성 (총 7개의 데이터프레임, 각각 200개 레코드 존재)
  2. 한 데이터프레임 내에는 영화 코드, 한글 제목, 영어 제목, 연도, 네티즌 평점, 등급, 링크, 이미지, 장르 존재
  3. 상세 정보를 가져오기 위해 각 장르별 평점 순 랭킹 페이지를 1~4 page 순으로 크롤링
    • 적어도 4 page까지는 평점 랭킹 존재하는 영화들 수집 (각 200개 * 7개 장르 = 1400개 데이터)
    • 드라마: 1, 공포: 4, 멜로/애정/로맨스: 5, 코미디: 11, 애니메이션: 15, 범죄: 16, 액션: 19
  4. 크롤링하면서 각 영화별 상세 페이지에 접속하여 한글 제목, 영어 제목, 연도, 네티즌 평점, 등급, 이미지, 장르 가져오기
  5. 각 장르별로 반복

성인 인증이 필요한 영화의 경우 로그인하지 않은 상태로는 바로 페이지 이동이 되지 않으므로, 미리 로그인이 필요

login_url = 'https://nid.naver.com/nidlogin.login'
id_key = 'kmoshn815'
pw_key = '********'
driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
# Basic NAVER movie url
basic_url = 'https://movie.naver.com'
# Basic Ranking url (genre, page는 변수 형태)
rank_url = 'https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=pnt&date=20200616&tg={genre}&page={page}'

a. Drama

drama_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
                                 'rating', 'rank', 'link', 'genre'])
drama_df
code title_kor title_eng year rating rank link genre
poster_drama = np.memmap('images/poster_drama', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
    print('page ' + str(l + 1) + ' crawling...')
    # 단일 페이지 접속
    driver.get(rank_url.format(genre=1, page=l + 1))
    html = driver.page_source
    time.sleep(1)
    soup = BeautifulSoup(html, 'lxml')
    
    for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
        # link
        link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
        
        # rating
        rating = soup.find_all('td', class_='point')[k].text
        
        # code
        code_start = re.search('code=', link).span()[1]
        code = link[code_start:]
        
        # 상세 페이지 접속
        driver.get(link)
        link_html = driver.page_source
        time.sleep(1)
        link_soup = BeautifulSoup(link_html, 'lxml')
        
        # 영화 정보
        movie_info = link_soup.find('div', class_='mv_info')
        
        # title_kor
        title_kor = movie_info.h3.a.text
        
        # title_eng
        title_eng = movie_info.strong.text.split(',')[0].strip()
        
        # year
        year = movie_info.strong.text.split(',')[-1].strip()
        
        # rank
        rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text

        # genre
        genre = []
        for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
            genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
        
        # poster
        poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
        poster_end = re.search('\?', poster_thumbnail).span()[0]
        poster_link = poster_thumbnail[:poster_end]
        poster_index = l * 50 + k
        poster_drama[poster_index, :, :, :] = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
        
        # 데이터프레임에 추가
        drama_df = drama_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
                                    'year': year, 'rating': rating, 'rank': rank, 'link': link,
                                    'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...

100%|██████████| 50/50 [01:25<00:00,  1.71s/it]

page 2 crawling...

100%|██████████| 50/50 [01:20<00:00,  1.61s/it]

page 3 crawling...

100%|██████████| 50/50 [01:20<00:00,  1.60s/it]

page 4 crawling...

100%|██████████| 50/50 [01:23<00:00,  1.67s/it]

Completed!
driver.close()
poster_drama.flush()
drama_df
code title_kor title_eng year rating rank link genre
0 171539 그린 북 Green Book 2018 9.59 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마]
1 174830 가버나움 Capharnaum 2018 9.58 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마]
2 151196 원더 Wonder 2017 9.49 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마]
3 169240 아일라 Ayla: The Daughter of War 2017 9.48 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마, 전쟁]
4 157243 당갈 Dangal 2016 9.47 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마, 액션]
... ... ... ... ... ... ... ... ...
195 34566 루키 The Rookie 2002 8.95 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마]
196 16792 흐르는 강물처럼 A River Runs Through It 1992 8.95 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마]
197 129046 리틀 보이 Little Boy 2015 8.95 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마, 전쟁]
198 83160 씨민과 나데르의 별거 Jodaeiye Nader Az Simin 2011 8.95 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마]
199 27109 소년은 울지 않는다 Boys Don't Cry 1999 8.95 청소년 관람불가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마]

200 rows × 8 columns

drama_df.to_csv('data/drama_df.csv', index=False)
np.save('images/poster_drama.npy', poster_drama)
poster_d = np.load('images/poster_drama.npy')
print(poster_d.shape)
poster_d
(200, 256, 256, 3)





array([[[[  0, 100, 116],
         [  0, 100, 116],
         [  0, 100, 117],
         ...,
         [  1, 115, 141],
         [  0, 116, 141],
         [  1, 117, 142]],

        [[  0, 100, 116],
         [  0, 100, 115],
         [  0, 101, 118],
         ...,
         [  0, 115, 141],
         [  0, 116, 141],
         [  0, 116, 141]],

        [[  0, 101, 116],
         [  0, 101, 116],
         [  0, 101, 118],
         ...,
         [  1, 117, 142],
         [  1, 117, 142],
         [  1, 117, 142]],

        ...,

        [[  0, 106, 130],
         [  1, 107, 131],
         [  0, 108, 132],
         ...,
         [  0, 105, 127],
         [  0, 105, 127],
         [  0, 105, 127]],

        [[  0, 106, 130],
         [  0, 106, 130],
         [  1, 107, 131],
         ...,
         [  1, 104, 127],
         [  1, 104, 126],
         [  1, 104, 127]],

        [[  0, 106, 130],
         [  0, 106, 130],
         [  1, 107, 131],
         ...,
         [  1, 104, 127],
         [  1, 104, 127],
         [  2, 104, 127]]],


       [[[227, 221, 218],
         [231, 231, 227],
         [237, 238, 239],
         ...,
         [ 87,  56,  93],
         [ 89,  58,  94],
         [ 89,  57,  92]],

        [[231, 229, 226],
         [232, 231, 227],
         [237, 237, 236],
         ...,
         [ 89,  58,  95],
         [ 88,  58,  93],
         [ 89,  58,  94]],

        [[236, 235, 233],
         [236, 236, 235],
         [237, 237, 236],
         ...,
         [ 88,  58,  96],
         [ 88,  58,  95],
         [ 89,  58,  94]],

        ...,

        [[104, 129, 151],
         [105, 128, 150],
         [106, 130, 152],
         ...,
         [125, 125,  84],
         [123, 124,  94],
         [ 97, 104, 102]],

        [[103, 129, 151],
         [102, 128, 150],
         [104, 129, 152],
         ...,
         [115, 118,  87],
         [111, 115,  91],
         [ 87,  96,  99]],

        [[100, 129, 149],
         [100, 128, 149],
         [103, 128, 150],
         ...,
         [106, 112,  85],
         [ 97, 106,  86],
         [ 79,  85,  93]]],


       [[[242, 225, 217],
         [241, 223, 213],
         [241, 222, 212],
         ...,
         [250, 250, 231],
         [253, 252, 248],
         [253, 253, 252]],

        [[252, 252, 252],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        [[247, 234, 229],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        ...,

        [[100,  97,  86],
         [ 98,  95,  85],
         [ 96,  95,  85],
         ...,
         [114, 102,  86],
         [111,  98,  79],
         [114,  97,  77]],

        [[100,  99,  90],
         [ 89,  88,  79],
         [ 76,  73,  65],
         ...,
         [114, 100,  81],
         [118, 102,  83],
         [119, 100,  80]],

        [[ 90,  89,  83],
         [ 89,  86,  80],
         [ 88,  83,  76],
         ...,
         [102,  90,  73],
         [104,  92,  76],
         [103,  88,  74]]],


       ...,


       [[[ 16,  25,  42],
         [ 16,  25,  42],
         [ 16,  25,  42],
         ...,
         [ 16,  25,  42],
         [ 16,  25,  42],
         [ 16,  25,  41]],

        [[ 16,  25,  42],
         [ 16,  25,  42],
         [ 16,  25,  42],
         ...,
         [ 16,  25,  42],
         [ 16,  25,  42],
         [ 16,  25,  42]],

        [[ 16,  25,  42],
         [ 16,  25,  42],
         [ 16,  25,  42],
         ...,
         [ 16,  25,  42],
         [ 16,  25,  42],
         [ 16,  25,  42]],

        ...,

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  1,   1,   1]],

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  1,   1,   1]],

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  1,   1,   1]]],


       [[[ 77,  65,  62],
         [ 74,  63,  59],
         [ 72,  61,  56],
         ...,
         [ 93,  83,  74],
         [ 93,  83,  73],
         [ 97,  87,  78]],

        [[ 78,  66,  60],
         [ 74,  64,  58],
         [ 71,  61,  56],
         ...,
         [ 92,  82,  73],
         [ 92,  82,  73],
         [ 98,  88,  79]],

        [[ 78,  66,  59],
         [ 74,  63,  58],
         [ 71,  60,  56],
         ...,
         [ 92,  82,  73],
         [ 93,  83,  74],
         [ 97,  87,  78]],

        ...,

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]],

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]],

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]]],


       [[[146, 125,  54],
         [142, 121,  80],
         [146, 132,  95],
         ...,
         [252, 243, 142],
         [252, 243, 142],
         [253, 243, 144]],

        [[225, 196, 110],
         [182, 155,  92],
         [151, 130,  82],
         ...,
         [252, 243, 142],
         [252, 243, 142],
         [253, 243, 144]],

        [[224, 182, 102],
         [229, 193, 106],
         [221, 195, 108],
         ...,
         [252, 243, 142],
         [252, 243, 142],
         [252, 242, 144]],

        ...,

        [[205,  90,  43],
         [199,  83,  36],
         [206,  90,  43],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  5,   5,   5]],

        [[203,  88,  41],
         [191,  75,  28],
         [196,  80,  33],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  4,   4,   4]],

        [[209,  96,  48],
         [195,  81,  34],
         [197,  83,  36],
         ...,
         [  3,   3,   3],
         [  2,   2,   2],
         [  9,   9,   9]]]], dtype=uint8)

b. Horror

driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
time.sleep(1)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
horror_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
                                 'rating', 'rank', 'link', 'genre'])
horror_df
code title_kor title_eng year rating rank link genre
poster_horror = np.memmap('images/poster_horror', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
    print('page ' + str(l + 1) + ' crawling...')
    # 단일 페이지 접속
    driver.get(rank_url.format(genre=4, page=l + 1))
    html = driver.page_source
    time.sleep(1)
    soup = BeautifulSoup(html, 'lxml')
    
    for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
        # link
        link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
        
        # rating
        rating = soup.find_all('td', class_='point')[k].text
        
        # code
        code_start = re.search('code=', link).span()[1]
        code = link[code_start:]
        
        # 상세 페이지 접속
        driver.get(link)
        link_html = driver.page_source
        time.sleep(1)
        link_soup = BeautifulSoup(link_html, 'lxml')
        
        # 영화 정보
        movie_info = link_soup.find('div', class_='mv_info')
        
        # title_kor
        title_kor = movie_info.h3.a.text
        
        # title_eng
        title_eng = movie_info.strong.text.split(',')[0].strip()
        
        # year
        year = movie_info.strong.text.split(',')[-1].strip()
        
        # rank
        rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text

        # genre
        genre = []
        for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
            genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
        
        # poster
        poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
        poster_end = re.search('\?', poster_thumbnail).span()[0]
        poster_link = poster_thumbnail[:poster_end]
        poster_index = l * 50 + k
        poster_horror[poster_index, :, :, :] = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
        
        # 데이터프레임에 추가
        horror_df = horror_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
                                      'year': year, 'rating': rating, 'rank': rank, 'link': link,
                                      'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...


100%|██████████| 50/50 [01:42<00:00,  2.06s/it]

page 2 crawling...

100%|██████████| 50/50 [01:37<00:00,  1.96s/it]

page 3 crawling...

100%|██████████| 50/50 [01:41<00:00,  2.03s/it]

page 4 crawling...

100%|██████████| 50/50 [01:48<00:00,  2.18s/it]

Completed!
driver.close()
poster_horror.flush()
horror_df
code title_kor title_eng year rating rank link genre
0 17254 뱀파이어와의 인터뷰 Interview With The Vampire: The Vampire Chroni... 1994 9.12 청소년 관람불가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [공포, 드라마]
1 10037 에이리언 Alien 1979 9.12 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [공포, SF]
2 10050 싸이코 Psycho 1960 9.11 청소년 관람불가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [공포, 스릴러, 미스터리]
3 10029 죠스 Jaws 1975 8.90 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [공포, 스릴러]
4 126389 무서운 집 Scary house 2014 8.89 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [공포]
... ... ... ... ... ... ... ... ...
195 172003 속닥속닥 The Whispering 2018 5.06 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [공포, 미스터리]
196 65535 해부학 교실 Cadaver 2007 5.06 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [공포, 미스터리]
197 65901 할로윈: 살인마의 탄생 Halloween 2007 5.06 청소년 관람불가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [공포]
198 63202 커버넌트 The Covenant 2006 5.05 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [공포, 스릴러]
199 125436 포레스트: 죽음의 숲 The Forest 2016 5.04 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [공포]

200 rows × 8 columns

horror_df.to_csv('data/horror_df.csv', index=False)
np.save('images/poster_horror.npy', poster_horror)
poster_h = np.load('images/poster_horror.npy')
print(poster_h.shape)
poster_h
(200, 256, 256, 3)





array([[[[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]],

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]],

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]],

        ...,

        [[ 15,   5,   9],
         [ 12,   3,   6],
         [  8,   3,   4],
         ...,
         [ 12,   8,   5],
         [ 13,   9,   6],
         [ 13,   9,   6]],

        [[ 14,   4,   8],
         [ 12,   3,   6],
         [  8,   3,   4],
         ...,
         [ 12,   8,   5],
         [ 14,  10,   7],
         [ 14,  10,   7]],

        [[ 12,   2,   6],
         [ 10,   1,   4],
         [  6,   1,   2],
         ...,
         [ 12,   8,   5],
         [ 12,   8,   5],
         [ 13,   9,   6]]],


       [[[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  1,   0,   1],
         [  1,   1,   0],
         [  1,   2,   1]],

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  2,   0,   1],
         [  1,   0,   1],
         [  6,  12,   6]],

        [[  0,   1,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  1,   0,   0]],

        ...,

        [[212, 182, 128],
         [ 91,  66,  51],
         [  3,   0,   0],
         ...,
         [ 45,  53,  27],
         [ 61,  71,  52],
         [ 78,  89,  76]],

        [[122, 108,  32],
         [ 71,  53,  18],
         [ 12,   3,   0],
         ...,
         [ 39,  34,   5],
         [ 30,  36,   4],
         [ 41,  52,  23]],

        [[181, 147, 125],
         [129, 104,  83],
         [ 78,  55,  48],
         ...,
         [ 58,  63,  40],
         [ 78,  88,  74],
         [ 59,  64,  63]]],


       [[[102,   2,   2],
         [ 98,   2,   1],
         [100,   1,   1],
         ...,
         [ 64,   2,   2],
         [ 62,   2,   2],
         [ 62,   2,   2]],

        [[103,   1,   2],
         [102,   1,   2],
         [104,   2,   2],
         ...,
         [ 61,   3,   2],
         [ 62,   3,   2],
         [ 62,   3,   2]],

        [[104,   1,   2],
         [104,   1,   2],
         [104,   1,   2],
         ...,
         [ 61,   3,   2],
         [ 60,   3,   2],
         [ 61,   3,   2]],

        ...,

        [[ 16,   0,   1],
         [ 16,   0,   1],
         [ 16,   0,   1],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]],

        [[ 16,   0,   1],
         [ 16,   0,   1],
         [ 17,   1,   2],
         ...,
         [  2,   0,   1],
         [  2,   0,   1],
         [  1,   1,   1]],

        [[ 17,   1,   2],
         [ 17,   1,   2],
         [ 18,   1,   3],
         ...,
         [ 13,   1,   3],
         [ 20,   1,   1],
         [ 31,   2,   4]]],


       ...,


       [[[  3,   4,   9],
         [  1,   2,   7],
         [  2,   3,   8],
         ...,
         [  4,   4,   6],
         [  3,   3,   5],
         [  1,   1,   3]],

        [[  3,   4,   9],
         [  1,   2,   7],
         [  2,   3,   8],
         ...,
         [  4,   4,   6],
         [  3,   3,   5],
         [  1,   1,   3]],

        [[  2,   3,   7],
         [  2,   3,   7],
         [  1,   2,   6],
         ...,
         [  4,   4,   6],
         [  3,   3,   5],
         [  2,   2,   4]],

        ...,

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]],

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]],

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]]],


       [[[ 11,  40,  58],
         [  9,  38,  56],
         [ 10,  37,  56],
         ...,
         [ 14,  24,  36],
         [ 14,  24,  36],
         [ 15,  25,  36]],

        [[  9,  38,  56],
         [  7,  36,  54],
         [  9,  36,  55],
         ...,
         [ 13,  23,  35],
         [ 13,  23,  36],
         [ 13,  22,  37]],

        [[  9,  38,  56],
         [  7,  36,  54],
         [  8,  35,  54],
         ...,
         [ 14,  24,  36],
         [ 14,  24,  36],
         [ 14,  23,  37]],

        ...,

        [[  4,   4,   4],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]],

        [[  4,   4,   4],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]],

        [[  4,   4,   4],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0]]],


       [[[168, 158, 146],
         [168, 158, 146],
         [167, 157, 145],
         ...,
         [173, 165, 152],
         [173, 165, 152],
         [172, 164, 151]],

        [[169, 159, 147],
         [169, 159, 147],
         [168, 158, 146],
         ...,
         [174, 166, 153],
         [173, 165, 152],
         [173, 165, 152]],

        [[169, 159, 147],
         [169, 159, 147],
         [168, 158, 146],
         ...,
         [174, 166, 153],
         [174, 166, 153],
         [173, 165, 152]],

        ...,

        [[203, 197, 185],
         [203, 197, 185],
         [203, 197, 185],
         ...,
         [204, 198, 186],
         [204, 198, 186],
         [204, 198, 186]],

        [[203, 197, 185],
         [203, 197, 185],
         [203, 197, 185],
         ...,
         [204, 198, 186],
         [204, 198, 186],
         [204, 198, 186]],

        [[203, 197, 185],
         [202, 196, 184],
         [202, 196, 184],
         ...,
         [204, 198, 186],
         [204, 198, 186],
         [204, 198, 186]]]], dtype=uint8)

c. Romance

driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
time.sleep(1)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
romance_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
                                 'rating', 'rank', 'link', 'genre'])
romance_df
code title_kor title_eng year rating rank link genre
poster_romance = np.memmap('images/poster_romance', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
    print('page ' + str(l + 1) + ' crawling...')
    # 단일 페이지 접속
    driver.get(rank_url.format(genre=5, page=l + 1))
    html = driver.page_source
    time.sleep(1)
    soup = BeautifulSoup(html, 'lxml')
    
    for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
        # link
        link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
        
        # rating
        rating = soup.find_all('td', class_='point')[k].text
        
        # code
        code_start = re.search('code=', link).span()[1]
        code = link[code_start:]
        
        # 상세 페이지 접속
        driver.get(link)
        link_html = driver.page_source
        time.sleep(1)
        link_soup = BeautifulSoup(link_html, 'lxml')
        
        # 영화 정보
        movie_info = link_soup.find('div', class_='mv_info')
        
        # title_kor
        title_kor = movie_info.h3.a.text
        
        # title_eng
        title_eng = movie_info.strong.text.split(',')[0].strip()
        
        # year
        year = movie_info.strong.text.split(',')[-1].strip()
        
        # rank
        rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text

        # genre
        genre = []
        for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
            genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
        
        # poster
        poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
        poster_end = re.search('\?', poster_thumbnail).span()[0]
        poster_link = poster_thumbnail[:poster_end]
        poster_index = l * 50 + k
        poster_romance[poster_index, :, :, :] = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
        
        # 데이터프레임에 추가
        romance_df = romance_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
                                        'year': year, 'rating': rating, 'rank': rank, 'link': link,
                                        'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...

100%|██████████| 50/50 [01:54<00:00,  2.30s/it]

page 2 crawling...

100%|██████████| 50/50 [01:42<00:00,  2.05s/it]

page 3 crawling...

100%|██████████| 50/50 [01:35<00:00,  1.92s/it]

page 4 crawling...

100%|██████████| 50/50 [01:40<00:00,  2.00s/it]

Completed!
driver.close()
poster_romance.flush()
romance_df
code title_kor title_eng year rating rank link genre
0 10102 사운드 오브 뮤직 The Sound Of Music 1965 9.40 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [멜로/로맨스, 뮤지컬, 드라마]
1 35939 클래식 The Classic 2003 9.39 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [멜로/로맨스, 드라마]
2 18847 타이타닉 Titanic 1997 9.38 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [멜로/로맨스, 드라마]
3 39636 지금, 만나러 갑니다 いま、会いにゆきます 2004 9.34 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [멜로/로맨스, 드라마, 판타지]
4 182348 로망 Romang 2019 9.30 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [멜로/로맨스]
... ... ... ... ... ... ... ... ...
195 40163 게스 후? Guess Who 2005 7.42 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [멜로/로맨스, 코미디]
196 85842 네버엔딩 스토리 Never Ending Story 2012 7.41 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [멜로/로맨스, 코미디]
197 69977 참을 수 없는. 2010 2010 7.39 청소년 관람불가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [멜로/로맨스, 드라마]
198 69952 호우시절 好雨時節 2009 7.38 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [멜로/로맨스]
199 64195 기다리다 미쳐 Crazy For Wait 2007 7.37 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [멜로/로맨스]

200 rows × 8 columns

romance_df.to_csv('data/romance_df.csv', index=False)
np.save('images/poster_romance.npy', poster_romance)
poster_r = np.load('images/poster_romance.npy')
print(poster_r.shape)
poster_r
(200, 256, 256, 3)





array([[[[ 20,  66, 160],
         [ 20,  66, 160],
         [ 20,  66, 160],
         ...,
         [ 19,  86, 173],
         [ 19,  86, 173],
         [ 19,  86, 173]],

        [[ 20,  66, 160],
         [ 20,  66, 160],
         [ 21,  67, 161],
         ...,
         [ 20,  87, 174],
         [ 19,  86, 173],
         [ 20,  87, 174]],

        [[ 21,  67, 161],
         [ 20,  67, 161],
         [ 21,  67, 161],
         ...,
         [ 19,  86, 173],
         [ 19,  86, 173],
         [ 19,  86, 173]],

        ...,

        [[ 38,  70,  33],
         [ 38,  71,  34],
         [ 41,  74,  35],
         ...,
         [ 68,  97,  40],
         [ 63,  94,  40],
         [ 45,  80,  37]],

        [[ 43,  79,  37],
         [ 44,  80,  38],
         [ 41,  79,  35],
         ...,
         [ 42,  77,  35],
         [ 42,  79,  36],
         [ 46,  83,  39]],

        [[ 41,  78,  35],
         [ 41,  78,  35],
         [ 40,  77,  34],
         ...,
         [ 41,  78,  35],
         [ 42,  79,  36],
         [ 42,  79,  36]]],


       [[[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        ...,

        [[ 16,  12,  13],
         [ 16,  12,  13],
         [ 18,  14,  14],
         ...,
         [  5,   1,   2],
         [  3,   1,   2],
         [ 18,  17,  17]],

        [[ 19,  15,  16],
         [ 19,  15,  16],
         [ 20,  16,  15],
         ...,
         [ 10,   6,   7],
         [ 11,   7,   8],
         [ 26,  22,  23]],

        [[ 67,  63,  64],
         [ 69,  65,  66],
         [ 69,  65,  64],
         ...,
         [ 69,  65,  66],
         [ 71,  67,  68],
         [ 79,  75,  76]]],


       [[[106,  75,  74],
         [103,  74,  70],
         [106,  73,  73],
         ...,
         [134,  96, 100],
         [135,  96, 100],
         [137,  96, 100]],

        [[112,  82,  85],
         [109,  79,  83],
         [111,  79,  84],
         ...,
         [144, 106, 116],
         [143, 106, 117],
         [144, 108, 119]],

        [[113,  81,  84],
         [111,  80,  83],
         [112,  80,  82],
         ...,
         [144, 105, 116],
         [142, 105, 114],
         [145, 106, 115]],

        ...,

        [[  3,   3,   3],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  3,   3,   3]],

        [[  3,   3,   3],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [  0,   0,   0],
         [  3,   3,   3]],

        [[  5,   5,   5],
         [  2,   2,   2],
         [  2,   2,   2],
         ...,
         [  2,   2,   2],
         [  2,   2,   2],
         [  5,   5,   5]]],


       ...,


       [[[ 34,  23,  19],
         [ 34,  23,  19],
         [ 34,  23,  19],
         ...,
         [ 38,  27,  23],
         [ 37,  26,  22],
         [ 37,  26,  22]],

        [[ 34,  23,  19],
         [ 34,  23,  19],
         [ 34,  23,  19],
         ...,
         [ 37,  26,  22],
         [ 36,  25,  20],
         [ 35,  24,  19]],

        [[ 34,  23,  19],
         [ 34,  23,  19],
         [ 34,  23,  19],
         ...,
         [ 39,  28,  24],
         [ 39,  28,  24],
         [ 37,  26,  22]],

        ...,

        [[112,  74,  64],
         [140,  99,  85],
         [149, 106,  91],
         ...,
         [  2,   0,   2],
         [  3,   0,   1],
         [  3,   1,   2]],

        [[ 99,  64,  55],
         [123,  85,  73],
         [140,  99,  86],
         ...,
         [  4,   1,   2],
         [  4,   1,   2],
         [  6,   2,   3]],

        [[ 81,  50,  42],
         [107,  71,  61],
         [122,  83,  72],
         ...,
         [  6,   2,   3],
         [  7,   3,   4],
         [  8,   3,   4]]],


       [[[169, 202, 177],
         [171, 204, 174],
         [165, 203, 165],
         ...,
         [166, 146, 148],
         [166, 146, 148],
         [166, 147, 149]],

        [[163, 199, 173],
         [161, 198, 171],
         [155, 196, 163],
         ...,
         [168, 148, 150],
         [165, 146, 148],
         [164, 145, 147]],

        [[150, 194, 160],
         [150, 192, 159],
         [151, 193, 158],
         ...,
         [168, 149, 151],
         [165, 146, 148],
         [162, 143, 145]],

        ...,

        [[146, 131, 110],
         [146, 131, 111],
         [141, 128, 108],
         ...,
         [ 55,  81,  54],
         [ 53,  80,  53],
         [ 53,  78,  52]],

        [[139, 126, 107],
         [139, 126, 107],
         [138, 126, 107],
         ...,
         [ 49,  76,  50],
         [ 49,  76,  49],
         [ 50,  73,  47]],

        [[140, 129, 107],
         [134, 122, 103],
         [139, 126, 109],
         ...,
         [ 47,  74,  47],
         [ 49,  76,  48],
         [ 52,  77,  49]]],


       [[[ 82, 110,  59],
         [ 80, 108,  57],
         [ 80, 108,  57],
         ...,
         [ 80, 108,  57],
         [ 80, 108,  57],
         [ 82, 110,  59]],

        [[ 81, 109,  58],
         [ 79, 107,  56],
         [ 79, 107,  56],
         ...,
         [ 79, 107,  56],
         [ 79, 107,  56],
         [ 81, 109,  58]],

        [[ 82, 110,  59],
         [ 79, 107,  56],
         [ 79, 107,  56],
         ...,
         [ 79, 107,  56],
         [ 79, 107,  56],
         [ 81, 109,  58]],

        ...,

        [[167,  51,  50],
         [188,  52,  53],
         [194,  51,  54],
         ...,
         [184,  42,  62],
         [176,  40,  53],
         [178,  98,  93]],

        [[ 69,  17,  13],
         [ 92,  25,  22],
         [ 99,  27,  25],
         ...,
         [189,  41,  64],
         [180,  40,  57],
         [174,  58,  63]],

        [[ 64,  17,  13],
         [ 81,  21,  18],
         [ 99,  30,  27],
         ...,
         [192,  56,  74],
         [191,  41,  66],
         [181,  44,  56]]]], dtype=uint8)

d. Comedy

driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
time.sleep(1)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
comedy_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
                                  'rating', 'rank', 'link', 'genre'])
comedy_df
code title_kor title_eng year rating rank link genre
poster_comedy = np.memmap('images/poster_comedy', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
    print('page ' + str(l + 1) + ' crawling...')
    # 단일 페이지 접속
    driver.get(rank_url.format(genre=11, page=l + 1))
    html = driver.page_source
    time.sleep(1)
    soup = BeautifulSoup(html, 'lxml')
    
    for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
        # link
        link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
        
        # rating
        rating = soup.find_all('td', class_='point')[k].text
        
        # code
        code_start = re.search('code=', link).span()[1]
        code = link[code_start:]
        
        # 상세 페이지 접속
        driver.get(link)
        link_html = driver.page_source
        time.sleep(1)
        link_soup = BeautifulSoup(link_html, 'lxml')
        
        # 영화 정보
        movie_info = link_soup.find('div', class_='mv_info')
        
        # title_kor
        title_kor = movie_info.h3.a.text
        
        # title_eng
        title_eng = movie_info.strong.text.split(',')[0].strip()
        
        # year
        year = movie_info.strong.text.split(',')[-1].strip()
        
        # rank
        rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text

        # genre
        genre = []
        for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
            genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
        
        # poster
        poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
        poster_end = re.search('\?', poster_thumbnail).span()[0]
        poster_link = poster_thumbnail[:poster_end]
        poster_index = l * 50 + k
        # gif로 받아지는 경우 3개 채널로 다시 만들어주어야 함
        if np.asarray(Image.open(request.urlopen(poster_link))).ndim == 2:
            im = Image.open(request.urlopen(poster_link)).resize((256, 256))
            background = Image.new('RGB', im.size, (255, 255, 255))
            background.paste(im)
            poster_array = np.asarray(background)
        else:
            poster_array = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
        poster_comedy[poster_index, :, :, :] = poster_array
        
        # 데이터프레임에 추가
        comedy_df = comedy_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
                                      'year': year, 'rating': rating, 'rank': rank, 'link': link,
                                      'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...

100%|██████████| 50/50 [02:07<00:00,  2.56s/it]

page 2 crawling...

100%|██████████| 50/50 [01:47<00:00,  2.16s/it]

page 3 crawling...

100%|██████████| 50/50 [02:10<00:00,  2.61s/it]

page 4 crawling...

100%|██████████| 50/50 [02:53<00:00,  3.47s/it]

Completed!
driver.close()
poster_comedy.flush()
comedy_df
code title_kor title_eng year rating rank link genre
0 18543 서유기 2 - 선리기연 西遊記 完結篇 之 仙履奇緣: A Chinese Odyssey Part Two - C... 1994 9.35 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [코미디, 액션, 모험, 판타지, 멜로/로맨스]
1 73372 세 얼간이 3 Idiots 2009 9.35 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [코미디]
2 19099 트루먼 쇼 The Truman Show 1998 9.34 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [코미디, 드라마, SF]
3 87566 언터처블: 1%의 우정 Intouchables 2011 9.33 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [코미디, 드라마]
4 16210 미세스 다웃파이어 Mrs. Doubtfire 1993 9.33 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [코미디, 가족, 드라마]
... ... ... ... ... ... ... ... ...
195 39397 윔블던 Wimbledon 2004 8.10 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [코미디, 멜로/로맨스]
196 91073 박수건달 Man on the Edge 2012 8.10 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [코미디]
197 98738 프란시스 하 Frances Ha 2012 8.10 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [코미디, 멜로/로맨스]
198 62219 색즉시공 시즌 2 Sex Is Zero 2 2007 8.09 청소년 관람불가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [코미디]
199 74954 두 번의 결혼식과 한 번의 장례식 Two Weddings And A Funeral 2012 8.09 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [코미디, 멜로/로맨스]

200 rows × 8 columns

comedy_df.to_csv('data/comedy_df.csv', index=False)
np.save('images/poster_comedy.npy', poster_comedy)
poster_c = np.load('images/poster_comedy.npy')
print(poster_c.shape)
poster_c
(200, 256, 256, 3)





array([[[[  8,   8,   8],
         [  8,   8,   8],
         [  9,   9,   9],
         ...,
         [  8,   8,   8],
         [  8,   8,   8],
         [  8,   8,   8]],

        [[  9,   9,   9],
         [  9,   9,   9],
         [  9,   9,   9],
         ...,
         [  9,   9,   9],
         [  8,   8,   8],
         [  8,   8,   8]],

        [[  9,   9,   9],
         [  9,   9,   9],
         [ 10,  10,  10],
         ...,
         [  9,   9,   9],
         [  9,   9,   9],
         [  9,   9,   9]],

        ...,

        [[  8,   7,   5],
         [ 11,   9,   7],
         [ 13,  10,   8],
         ...,
         [ 31,  28,  24],
         [ 30,  27,  23],
         [ 30,  27,  23]],

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [ 44,  37,  31],
         [ 47,  40,  33],
         [ 48,  41,  34]],

        [[  0,   0,   0],
         [  0,   0,   0],
         [  0,   0,   0],
         ...,
         [  2,   1,   1],
         [  5,   4,   4],
         [  9,   8,   7]]],


       [[[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        ...,

        [[252, 252, 252],
         [253, 253, 253],
         [253, 253, 253],
         ...,
         [252, 254, 253],
         [252, 254, 253],
         [252, 254, 253]],

        [[251, 254, 253],
         [253, 253, 253],
         [252, 253, 252],
         ...,
         [251, 253, 252],
         [251, 253, 252],
         [251, 253, 252]],

        [[251, 253, 252],
         [251, 252, 252],
         [251, 253, 252],
         ...,
         [251, 253, 252],
         [251, 253, 252],
         [252, 254, 253]]],


       [[[244, 248, 252],
         [244, 248, 252],
         [244, 248, 252],
         ...,
         [237, 246, 251],
         [237, 246, 251],
         [237, 246, 251]],

        [[244, 248, 252],
         [243, 248, 252],
         [244, 248, 253],
         ...,
         [237, 246, 251],
         [237, 246, 251],
         [237, 246, 251]],

        [[244, 248, 252],
         [243, 248, 252],
         [244, 248, 252],
         ...,
         [237, 246, 251],
         [237, 246, 251],
         [237, 246, 251]],

        ...,

        [[111,  87,  98],
         [ 77,  52,  65],
         [ 98,  73,  84],
         ...,
         [ 52,  24,  41],
         [ 46,  20,  36],
         [ 51,  24,  40]],

        [[103,  80,  90],
         [107,  81,  92],
         [105,  78,  90],
         ...,
         [ 54,  27,  43],
         [ 49,  24,  39],
         [ 54,  28,  44]],

        [[ 73,  50,  61],
         [ 96,  71,  82],
         [ 90,  65,  76],
         ...,
         [ 51,  26,  40],
         [ 62,  36,  52],
         [ 61,  36,  51]]],


       ...,


       [[[119, 117, 107],
         [ 20,  21,  16],
         [ 21,  22,  17],
         ...,
         [100, 100,  91],
         [102, 102,  93],
         [ 70,  71,  64]],

        [[111, 110, 102],
         [ 19,  19,  16],
         [ 21,  22,  18],
         ...,
         [ 73,  74,  68],
         [ 76,  77,  70],
         [ 36,  37,  31]],

        [[112, 110, 103],
         [ 19,  20,  17],
         [ 21,  22,  17],
         ...,
         [ 44,  45,  40],
         [ 32,  33,  28],
         [ 28,  28,  23]],

        ...,

        [[ 28,  28,  23],
         [ 28,  27,  22],
         [ 24,  23,  19],
         ...,
         [ 27,  28,  23],
         [ 30,  31,  26],
         [ 30,  31,  27]],

        [[ 14,  14,  10],
         [ 16,  16,  11],
         [ 18,  17,  14],
         ...,
         [ 26,  27,  22],
         [ 32,  33,  28],
         [ 34,  35,  30]],

        [[ 64,  63,  55],
         [ 43,  42,  36],
         [ 18,  17,  12],
         ...,
         [ 34,  35,  30],
         [ 37,  38,  33],
         [ 41,  42,  37]]],


       [[[ 13,   0,   0],
         [ 20,   0,   1],
         [ 22,   2,   3],
         ...,
         [ 26,  10,   9],
         [ 70,  37,  33],
         [ 25,   9,   9]],

        [[  9,   1,   1],
         [ 10,   0,   0],
         [ 17,   0,   1],
         ...,
         [ 28,  12,  12],
         [ 30,  13,  13],
         [ 10,   0,   1]],

        [[ 32,  14,  12],
         [ 19,   6,   6],
         [  7,   0,   1],
         ...,
         [ 29,  12,  11],
         [ 12,   1,   1],
         [ 13,   1,   1]],

        ...,

        [[  4,   0,   0],
         [  5,   0,   0],
         [  6,   0,   0],
         ...,
         [  2,   0,   1],
         [  2,   0,   1],
         [  2,   0,   1]],

        [[  4,   0,   0],
         [  5,   0,   0],
         [  6,   0,   0],
         ...,
         [  2,   0,   1],
         [  2,   0,   1],
         [  2,   0,   1]],

        [[  4,   0,   0],
         [  4,   0,   0],
         [  4,   0,   0],
         ...,
         [  1,   0,   0],
         [  2,   0,   1],
         [  2,   0,   1]]],


       [[[227, 223, 215],
         [225, 220, 213],
         [223, 218, 212],
         ...,
         [ 25,  11,   7],
         [ 57,  41,  34],
         [103,  86,  78]],

        [[225, 220, 214],
         [222, 217, 211],
         [219, 214, 210],
         ...,
         [ 17,   4,   2],
         [ 27,  12,   7],
         [ 54,  36,  27]],

        [[218, 213, 208],
         [220, 215, 210],
         [218, 213, 209],
         ...,
         [ 16,   3,   3],
         [ 15,   2,   0],
         [ 26,   8,   3]],

        ...,

        [[ 61,  55,  54],
         [ 62,  56,  57],
         [ 63,  58,  59],
         ...,
         [137, 117, 110],
         [135, 113, 107],
         [133, 113, 104]],

        [[ 63,  57,  58],
         [ 63,  57,  59],
         [ 63,  57,  59],
         ...,
         [136, 116, 109],
         [137, 115, 108],
         [137, 114, 106]],

        [[ 61,  58,  60],
         [ 61,  57,  58],
         [ 62,  57,  57],
         ...,
         [135, 113, 106],
         [136, 113, 107],
         [140, 117, 109]]]], dtype=uint8)

e. Animation

driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
time.sleep(1)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
animation_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
                                     'rating', 'rank', 'link', 'genre'])
animation_df
code title_kor title_eng year rating rank link genre
poster_animation = np.memmap('images/poster_animation', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
    print('page ' + str(l + 1) + ' crawling...')
    # 단일 페이지 접속
    driver.get(rank_url.format(genre=15, page=l + 1))
    html = driver.page_source
    time.sleep(1)
    soup = BeautifulSoup(html, 'lxml')
    
    for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
        # link
        link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
        
        # rating
        rating = soup.find_all('td', class_='point')[k].text
        
        # code
        code_start = re.search('code=', link).span()[1]
        code = link[code_start:]
        
        # 상세 페이지 접속
        driver.get(link)
        link_html = driver.page_source
        time.sleep(1)
        link_soup = BeautifulSoup(link_html, 'lxml')
        
        # 영화 정보
        movie_info = link_soup.find('div', class_='mv_info')
        
        # title_kor
        title_kor = movie_info.h3.a.text
        
        # title_eng
        title_eng = movie_info.strong.text.split(',')[0].strip()
        
        # year
        year = movie_info.strong.text.split(',')[-1].strip()
        
        # rank
        try:
            rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text
        except:
            rank = np.nan

        # genre
        genre = []
        for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
            genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
        
        # poster
        poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
        poster_end = re.search('\?', poster_thumbnail).span()[0]
        poster_link = poster_thumbnail[:poster_end]
        poster_index = l * 50 + k
        # gif로 받아지는 경우 3개 채널로 다시 만들어주어야 함
        if np.asarray(Image.open(request.urlopen(poster_link))).ndim == 2:
            im = Image.open(request.urlopen(poster_link)).resize((256, 256))
            background = Image.new('RGB', im.size, (255, 255, 255))
            background.paste(im)
            poster_array = np.asarray(background)
        else:
            poster_array = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
        poster_animation[poster_index, :, :, :] = poster_array
        
        # 데이터프레임에 추가
        animation_df = animation_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
                                            'year': year, 'rating': rating, 'rank': rank, 'link': link,
                                            'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...

100%|██████████| 50/50 [03:46<00:00,  4.53s/it]

page 2 crawling...

100%|██████████| 50/50 [02:27<00:00,  2.95s/it]

page 3 crawling...

100%|██████████| 50/50 [03:27<00:00,  4.16s/it]

page 4 crawling...

100%|██████████| 50/50 [03:09<00:00,  3.78s/it]

Completed!
driver.close()
poster_animation.flush()
animation_df
code title_kor title_eng year rating rank link genre
0 69105 월-E WALL-E 2008 9.41 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [애니메이션, SF, 가족, 코미디, 멜로/로맨스, 모험]
1 32686 센과 치히로의 행방불명 千と千尋の神隠し 2001 9.39 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [애니메이션, 판타지, 모험, 가족]
2 66463 토이 스토리 3 Toy Story 3 2010 9.38 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [애니메이션, 모험, 코미디, 가족, 판타지]
3 130850 주토피아 Zootopia 2016 9.35 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [애니메이션, 액션, 모험, 코미디, 가족]
4 19303 모노노케 히메 もののけ-: Mononoke Hime 1997 9.35 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [애니메이션, 모험, 액션]
... ... ... ... ... ... ... ... ...
195 34449 아이스 에이지 Ice Age 2002 8.58 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [애니메이션, 모험, 가족, 판타지, 코미디]
196 122581 극장판 도라에몽 진구의 아프리카 모험 : 베코와 5인의 탐험대 映画ドラえもん 新・のび太の大魔境 ~ペコと5人の探検隊~ 2014 8.58 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [애니메이션, 모험]
197 144355 감바의 대모험 GAMBA ガンバと仲間たち 2015 8.57 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [애니메이션]
198 17230 포카혼타스 Pocahontas 1995 8.57 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [애니메이션, 가족, 모험, 드라마, 멜로/로맨스]
199 134980 마이펫의 이중생활 The Secret Life of Pets 2016 8.56 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [애니메이션, 코미디, 가족]

200 rows × 8 columns

animation_df.isnull().sum()
code         0
title_kor    0
title_eng    0
year         0
rating       0
rank         1
link         0
genre        0
dtype: int64
animation_df.loc[animation_df['rank'].isnull(), 'rank'] = '전체 관람가'
animation_df.isnull().sum()
code         0
title_kor    0
title_eng    0
year         0
rating       0
rank         0
link         0
genre        0
dtype: int64
animation_df.to_csv('data/animation_df.csv', index=False)
np.save('images/poster_animation.npy', poster_animation)
poster_a = np.load('images/poster_animation.npy')
print(poster_a.shape)
poster_a
(200, 256, 256, 3)





array([[[[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        ...,

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [254, 254, 254],
         [255, 255, 255],
         [255, 255, 255]],

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]]],


       [[[149,  21,  47],
         [126,  13,  38],
         [100,  14,  37],
         ...,
         [ 65,  43,  43],
         [ 67,  42,  44],
         [150,  27,  43]],

        [[190,  38,  71],
         [150,  22,  42],
         [115,   9,  37],
         ...,
         [ 70,  43,  44],
         [ 70,  42,  43],
         [156,  29,  43]],

        [[220,  76, 125],
         [165,  26,  46],
         [120,  10,  36],
         ...,
         [ 59,  44,  44],
         [ 77,  42,  44],
         [170,  36,  47]],

        ...,

        [[ 43,  55,  64],
         [ 44,  55,  65],
         [ 45,  55,  65],
         ...,
         [ 86,  80,  81],
         [ 86,  80,  81],
         [ 86,  80,  81]],

        [[ 44,  54,  64],
         [ 43,  55,  65],
         [ 44,  54,  64],
         ...,
         [ 86,  80,  81],
         [ 86,  80,  80],
         [ 86,  80,  81]],

        [[ 44,  54,  64],
         [ 44,  54,  64],
         [ 44,  54,  64],
         ...,
         [ 85,  79,  80],
         [ 85,  79,  79],
         [ 85,  79,  81]]],


       [[[  0,  19,  56],
         [  0,  18,  56],
         [  0,  21,  58],
         ...,
         [  2,  36,  81],
         [  2,  50,  95],
         [  2,  65, 110]],

        [[  0,  19,  57],
         [  1,  19,  57],
         [  1,  23,  60],
         ...,
         [  1,  70, 115],
         [  1,  64, 109],
         [  2,  51,  88]],

        [[  1,  22,  59],
         [  1,  25,  62],
         [  2,  28,  67],
         ...,
         [  1,  48,  87],
         [  1,  42,  80],
         [  0,  31,  64]],

        ...,

        [[ 19,  60, 108],
         [ 18,  60, 109],
         [ 21,  64, 114],
         ...,
         [  8,  44,  83],
         [ 10,  46,  85],
         [ 12,  48,  88]],

        [[ 19,  59, 109],
         [ 19,  61, 109],
         [ 15,  56, 104],
         ...,
         [  7,  44,  85],
         [  3,  41,  79],
         [  7,  45,  84]],

        [[ 11,  53,  97],
         [ 14,  55, 100],
         [ 13,  54,  99],
         ...,
         [  4,  43,  82],
         [  2,  40,  77],
         [  9,  44,  84]]],


       ...,


       [[[255, 255, 245],
         [255, 255, 245],
         [255, 255, 245],
         ...,
         [255, 255, 245],
         [255, 255, 245],
         [255, 255, 245]],

        [[255, 255, 245],
         [255, 255, 245],
         [255, 255, 245],
         ...,
         [255, 255, 245],
         [255, 255, 245],
         [255, 255, 245]],

        [[255, 255, 245],
         [255, 255, 245],
         [255, 255, 245],
         ...,
         [255, 255, 245],
         [255, 255, 245],
         [255, 255, 245]],

        ...,

        [[123, 158, 174],
         [127, 162, 178],
         [119, 156, 173],
         ...,
         [186, 199, 175],
         [163, 177, 151],
         [165, 173, 138]],

        [[ 91, 140, 163],
         [ 90, 139, 161],
         [ 97, 148, 163],
         ...,
         [221, 223, 207],
         [159, 157, 103],
         [170, 180, 138]],

        [[112, 161, 149],
         [121, 171, 145],
         [121, 171, 146],
         ...,
         [241, 246, 241],
         [179, 182, 134],
         [217, 218, 182]]],


       [[[229, 134, 154],
         [230, 135, 155],
         [229, 134, 154],
         ...,
         [238, 127, 170],
         [237, 126, 168],
         [239, 128, 170]],

        [[231, 136, 155],
         [232, 138, 156],
         [231, 137, 155],
         ...,
         [238, 127, 169],
         [237, 126, 168],
         [238, 127, 169]],

        [[234, 140, 158],
         [233, 139, 156],
         [233, 139, 156],
         ...,
         [238, 126, 169],
         [238, 126, 169],
         [239, 126, 169]],

        ...,

        [[ 26,  18,  16],
         [ 24,  16,  14],
         [ 22,  14,  12],
         ...,
         [254, 132, 145],
         [253, 131, 144],
         [253, 132, 145]],

        [[ 24,  16,  14],
         [ 23,  15,  13],
         [ 22,  14,  12],
         ...,
         [252, 129, 147],
         [253, 131, 148],
         [253, 131, 148]],

        [[ 23,  15,  13],
         [ 23,  15,  12],
         [ 23,  15,  13],
         ...,
         [253, 130, 150],
         [254, 129, 150],
         [252, 126, 148]]],


       [[[ 95, 109, 162],
         [ 92, 108, 168],
         [ 92, 108, 167],
         ...,
         [ 59,  86, 151],
         [ 58,  86, 151],
         [ 65,  88, 146]],

        [[100, 116, 176],
         [100, 119, 186],
         [100, 118, 186],
         ...,
         [ 64,  93, 167],
         [ 63,  93, 168],
         [ 68,  94, 160]],

        [[100, 117, 175],
         [ 99, 118, 186],
         [100, 116, 185],
         ...,
         [ 64,  92, 166],
         [ 63,  92, 167],
         [ 69,  94, 159]],

        ...,

        [[120,  84,  82],
         [121,  80,  77],
         [117,  77,  76],
         ...,
         [111,  75,  73],
         [118,  79,  77],
         [112,  80,  78]],

        [[118,  85,  84],
         [131,  91,  90],
         [123,  84,  83],
         ...,
         [123,  84,  82],
         [131,  91,  89],
         [115,  81,  79]],

        [[122,  89,  88],
         [117,  79,  79],
         [117,  79,  77],
         ...,
         [119,  81,  79],
         [120,  82,  80],
         [128,  93,  91]]]], dtype=uint8)

f. Crime

driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
time.sleep(1)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
crime_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
                                 'rating', 'rank', 'link', 'genre'])
crime_df
code title_kor title_eng year rating rank link genre
poster_crime = np.memmap('images/poster_crime', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
    print('page ' + str(l + 1) + ' crawling...')
    # 단일 페이지 접속
    driver.get(rank_url.format(genre=16, page=l + 1))
    html = driver.page_source
    time.sleep(1)
    soup = BeautifulSoup(html, 'lxml')
    
    for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
        # link
        link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
        
        # rating
        rating = soup.find_all('td', class_='point')[k].text
        
        # code
        code_start = re.search('code=', link).span()[1]
        code = link[code_start:]
        
        # 상세 페이지 접속
        driver.get(link)
        link_html = driver.page_source
        time.sleep(1)
        link_soup = BeautifulSoup(link_html, 'lxml')
        
        # 영화 정보
        movie_info = link_soup.find('div', class_='mv_info')
        
        # title_kor
        title_kor = movie_info.h3.a.text
        
        # title_eng
        title_eng = movie_info.strong.text.split(',')[0].strip()
        
        # year
        year = movie_info.strong.text.split(',')[-1].strip()
        
        # rank
        try:
            rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text
        except:
            rank = np.nan

        # genre
        genre = []
        for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
            genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
        
        # poster
        poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
        poster_end = re.search('\?', poster_thumbnail).span()[0]
        poster_link = poster_thumbnail[:poster_end]
        poster_index = l * 50 + k
        # gif로 받아지는 경우 3개 채널로 다시 만들어주어야 함
        if np.asarray(Image.open(request.urlopen(poster_link))).ndim == 2:
            im = Image.open(request.urlopen(poster_link)).resize((256, 256))
            background = Image.new('RGB', im.size, (255, 255, 255))
            background.paste(im)
            poster_array = np.asarray(background)
        else:
            poster_array = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
        poster_crime[poster_index, :, :, :] = poster_array
        
        # 데이터프레임에 추가
        crime_df = crime_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
                                    'year': year, 'rating': rating, 'rank': rank, 'link': link,
                                    'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...

100%|██████████| 50/50 [03:36<00:00,  4.33s/it]

page 2 crawling...

100%|██████████| 50/50 [02:01<00:00,  2.44s/it]

page 3 crawling...

100%|██████████| 50/50 [01:57<00:00,  2.35s/it]

page 4 crawling...

100%|██████████| 50/50 [02:37<00:00,  3.15s/it]

Completed!
driver.close()
poster_crime.flush()
crime_df
code title_kor title_eng year rating rank link genre
0 35901 살인의 추억 Memories Of Murder 2003 9.40 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [범죄, 미스터리, 스릴러, 코미디, 드라마]
1 17170 레옹 Leon 1994 9.37 청소년 관람불가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [범죄, 액션, 드라마]
2 29657 프리퀀시 Frequency 2000 9.32 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [범죄, 드라마, SF, 스릴러]
3 51462 그랜 토리노 Gran Torino 2008 9.23 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [범죄, 드라마]
4 10561 대부 3 Mario Puzo's The Godfather Part III 1990 9.21 청소년 관람불가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [범죄, 드라마]
... ... ... ... ... ... ... ... ...
195 43568 모노폴리 Monopoly 2006 6.42 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [범죄, 스릴러]
196 69742 킬 위드 미 Untraceable 2008 6.36 청소년 관람불가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [범죄, 스릴러]
197 154112 불한당: 나쁜 놈들의 세상 The Merciless 2016 6.35 청소년 관람불가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [범죄, 액션, 드라마]
198 157297 마약왕 THE DRUG KING 2017 6.33 청소년 관람불가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [범죄, 드라마]
199 65340 쏘우 4 Saw IV 2007 6.32 청소년 관람불가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [범죄, 스릴러, 공포]

200 rows × 8 columns

crime_df.to_csv('data/crime_df.csv', index=False)
np.save('images/poster_crime.npy', poster_crime)
poster_cr = np.load('images/poster_crime.npy')
print(poster_cr.shape)
poster_cr
(200, 256, 256, 3)





array([[[[ 96, 115, 119],
         [ 95, 114, 118],
         [ 94, 114, 118],
         ...,
         [ 67,  85,  89],
         [ 67,  85,  89],
         [ 62,  80,  84]],

        [[ 95, 114, 118],
         [ 98, 116, 121],
         [ 99, 119, 122],
         ...,
         [ 62,  80,  84],
         [ 66,  84,  88],
         [ 72,  90,  94]],

        [[ 99, 118, 122],
         [ 99, 118, 122],
         [ 99, 118, 122],
         ...,
         [ 73,  91,  95],
         [ 65,  83,  87],
         [ 66,  84,  88]],

        ...,

        [[  5,   7,   6],
         [  5,   7,   6],
         [  5,   7,   6],
         ...,
         [ 14,  23,  23],
         [ 12,  20,  20],
         [ 21,  30,  30]],

        [[  5,   7,   6],
         [  5,   7,   6],
         [  5,   7,   6],
         ...,
         [ 33,  42,  41],
         [ 22,  31,  30],
         [ 25,  34,  33]],

        [[  7,   9,   8],
         [  7,   9,   8],
         [  7,   9,   8],
         ...,
         [ 41,  49,  51],
         [ 44,  52,  53],
         [ 37,  45,  47]]],


       [[[ 19,  31,  58],
         [ 62,  74,  90],
         [120, 150, 182],
         ...,
         [123, 155, 188],
         [120, 145, 171],
         [ 87, 101, 119]],

        [[ 33,  41,  58],
         [ 79,  90, 107],
         [116, 151, 185],
         ...,
         [114, 148, 179],
         [106, 127, 147],
         [ 80,  89, 106]],

        [[ 39,  49,  64],
         [ 86, 100, 122],
         [116, 153, 188],
         ...,
         [ 95, 118, 145],
         [ 97, 113, 132],
         [ 80,  91, 107]],

        ...,

        [[128, 105,  76],
         [127, 103,  73],
         [131, 104,  75],
         ...,
         [ 95,  87,  71],
         [ 95,  88,  72],
         [ 94,  86,  71]],

        [[133, 104,  73],
         [135, 104,  70],
         [136, 107,  73],
         ...,
         [ 97,  87,  70],
         [ 94,  87,  71],
         [ 94,  87,  69]],

        [[137, 108,  73],
         [138, 107,  78],
         [140, 109,  78],
         ...,
         [ 96,  86,  70],
         [ 97,  88,  73],
         [ 94,  86,  68]]],


       [[[209, 151, 141],
         [207, 148, 139],
         [204, 143, 136],
         ...,
         [ 87,  46,  55],
         [ 98,  54,  44],
         [ 90,  42,  46]],

        [[223, 187, 186],
         [235, 215, 204],
         [232, 201, 187],
         ...,
         [143, 101,  66],
         [113,  75,  55],
         [123,  70,  53]],

        [[225, 192, 178],
         [231, 209, 194],
         [239, 222, 207],
         ...,
         [143, 100,  70],
         [ 90,  55,  44],
         [144,  80,  58]],

        ...,

        [[ 28,  15,  22],
         [ 25,  12,  19],
         [ 27,  14,  21],
         ...,
         [ 58,  47,  43],
         [ 60,  49,  45],
         [ 64,  54,  50]],

        [[ 30,  17,  24],
         [ 26,  13,  20],
         [ 32,  19,  26],
         ...,
         [ 56,  46,  42],
         [ 58,  47,  43],
         [ 61,  49,  46]],

        [[ 29,  16,  23],
         [ 30,  17,  24],
         [ 29,  16,  24],
         ...,
         [ 48,  40,  37],
         [ 51,  39,  37],
         [ 66,  51,  50]]],


       ...,


       [[[164, 127, 119],
         [119,  96,  98],
         [ 35,  37,  44],
         ...,
         [210, 221, 228],
         [209, 220, 225],
         [207, 216, 229]],

        [[148, 110, 103],
         [120,  92,  91],
         [ 45,  43,  47],
         ...,
         [213, 223, 232],
         [211, 221, 229],
         [209, 219, 230]],

        [[ 98,  70,  68],
         [ 90,  67,  68],
         [ 46,  41,  42],
         ...,
         [218, 228, 237],
         [214, 224, 233],
         [213, 223, 232]],

        ...,

        [[ 10,  18,  20],
         [ 10,  18,  20],
         [ 12,  20,  22],
         ...,
         [ 18,  25,  35],
         [ 18,  25,  35],
         [ 18,  25,  35]],

        [[ 10,  18,  20],
         [ 11,  19,  21],
         [ 12,  20,  22],
         ...,
         [ 18,  25,  35],
         [ 18,  25,  35],
         [ 18,  25,  35]],

        [[ 10,  18,  20],
         [ 10,  18,  20],
         [ 10,  18,  20],
         ...,
         [ 18,  25,  35],
         [ 18,  25,  35],
         [ 18,  25,  35]]],


       [[[ 84,  66,  41],
         [ 92,  75,  49],
         [104,  86,  54],
         ...,
         [157,  76,  35],
         [159,  77,  36],
         [160,  82,  40]],

        [[ 88,  68,  43],
         [ 96,  76,  49],
         [108,  88,  56],
         ...,
         [154,  70,  33],
         [157,  73,  35],
         [161,  80,  38]],

        [[ 88,  68,  43],
         [ 97,  77,  48],
         [112,  89,  58],
         ...,
         [157,  72,  34],
         [158,  71,  32],
         [160,  74,  36]],

        ...,

        [[ 15,  10,   7],
         [ 15,  10,   7],
         [ 14,  10,   7],
         ...,
         [  8,   4,   3],
         [  8,   4,   3],
         [  7,   3,   2]],

        [[ 16,  11,   8],
         [ 15,  10,   7],
         [ 13,   9,   7],
         ...,
         [  8,   4,   3],
         [  8,   4,   3],
         [  8,   4,   3]],

        [[ 15,  10,   7],
         [ 14,   9,   7],
         [ 14,  10,   8],
         ...,
         [  8,   4,   3],
         [  8,   4,   3],
         [  8,   4,   3]]],


       [[[227, 222, 219],
         [227, 222, 219],
         [227, 222, 219],
         ...,
         [188, 179, 171],
         [192, 183, 174],
         [191, 182, 173]],

        [[227, 222, 219],
         [227, 222, 219],
         [227, 222, 219],
         ...,
         [186, 178, 167],
         [188, 180, 170],
         [189, 180, 171]],

        [[227, 222, 219],
         [227, 222, 219],
         [227, 222, 219],
         ...,
         [181, 173, 162],
         [183, 174, 164],
         [185, 176, 167]],

        ...,

        [[252, 250, 251],
         [250, 248, 249],
         [248, 246, 247],
         ...,
         [  2,   2,   2],
         [  2,   2,   2],
         [  3,   3,   3]],

        [[253, 252, 252],
         [252, 250, 251],
         [250, 248, 249],
         ...,
         [  3,   3,   3],
         [  4,   4,   4],
         [  3,   3,   3]],

        [[254, 254, 254],
         [253, 252, 253],
         [253, 251, 252],
         ...,
         [  4,   4,   4],
         [  6,   6,   6],
         [  5,   5,   5]]]], dtype=uint8)

g. Action

driver = webdriver.Chrome(executable_path='/Users/ohhyunkwon/Documents/2020 study/etc/chromedriver')
driver.get(login_url)
time.sleep(1)
driver.execute_script("document.getElementsByName('id')[0].value=\'" + id_key + "\'")
driver.execute_script("document.getElementsByName('pw')[0].value=\'" + pw_key + "\'")
driver.find_element_by_xpath('//*[@id="frmNIDLogin"]/fieldset/input').click()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="new.dontsave"]').click()
action_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
                                  'rating', 'rank', 'link', 'genre'])
action_df
code title_kor title_eng year rating rank link genre
poster_action = np.memmap('images/poster_action', dtype=np.uint8, mode='w+', shape=(200, 256, 256, 3))
# 4 pages crawling
for l in range(4):
    print('page ' + str(l + 1) + ' crawling...')
    # 단일 페이지 접속
    driver.get(rank_url.format(genre=19, page=l + 1))
    html = driver.page_source
    time.sleep(1)
    soup = BeautifulSoup(html, 'lxml')
    
    for k in tqdm(range(len(soup.find('table', class_='list_ranking').find_all('div', class_='tit5')))):
        # link
        link = basic_url + soup.find('table', class_='list_ranking').find_all('div', class_='tit5')[k].a.get('href')
        
        # rating
        rating = soup.find_all('td', class_='point')[k].text
        
        # code
        code_start = re.search('code=', link).span()[1]
        code = link[code_start:]
        
        # 상세 페이지 접속
        driver.get(link)
        link_html = driver.page_source
        time.sleep(1)
        link_soup = BeautifulSoup(link_html, 'lxml')
        
        # 영화 정보
        movie_info = link_soup.find('div', class_='mv_info')
        
        # title_kor
        title_kor = movie_info.h3.a.text
        
        # title_eng
        title_eng = movie_info.strong.text.split(',')[0].strip()
        
        # year
        year = movie_info.strong.text.split(',')[-1].strip()
        
        # rank
        try:
            rank = movie_info.find('p').find_all('span')[4].find_all('a')[0].text
        except:
            rank = np.nan

        # genre
        genre = []
        for i in range(len(movie_info.find('p', class_='info_spec').span.text.split(','))):
            genre.append(movie_info.find('p', class_='info_spec').span.text.split(',')[i].strip())
        
        # poster
        poster_thumbnail = link_soup.find('div', class_='poster').a.img.get('src')
        poster_end = re.search('\?', poster_thumbnail).span()[0]
        poster_link = poster_thumbnail[:poster_end]
        poster_index = l * 50 + k
        # gif로 받아지는 경우 3개 채널로 다시 만들어주어야 함
        if np.asarray(Image.open(request.urlopen(poster_link))).ndim == 2:
            im = Image.open(request.urlopen(poster_link)).resize((256, 256))
            background = Image.new('RGB', im.size, (255, 255, 255))
            background.paste(im)
            poster_array = np.asarray(background)
        else:
            poster_array = np.asarray(Image.open(request.urlopen(poster_link)).resize((256, 256)))[:, :, :3]
        poster_action[poster_index, :, :, :] = poster_array
        
        # 데이터프레임에 추가
        action_df = action_df.append({'code': code, 'title_kor': title_kor, 'title_eng': title_eng,
                                      'year': year, 'rating': rating, 'rank': rank, 'link': link,
                                      'genre': genre}, ignore_index=True)
print('Completed!')
page 1 crawling...

100%|██████████| 50/50 [03:05<00:00,  3.71s/it]

page 2 crawling...

100%|██████████| 50/50 [03:25<00:00,  4.11s/it]

page 3 crawling...

100%|██████████| 50/50 [03:28<00:00,  4.17s/it]

page 4 crawling...

100%|██████████| 50/50 [03:23<00:00,  4.08s/it]

Completed!
driver.close()
poster_action.flush()
action_df
code title_kor title_eng year rating rank link genre
0 181710 포드 V 페라리 FORD v FERRARI 2019 9.49 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 드라마]
1 29217 글래디에이터 Gladiator 2000 9.39 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 드라마]
2 136900 어벤져스: 엔드게임 Avengers: Endgame 2019 9.38 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, SF]
3 92125 헌터 킬러 Hunter Killer 2018 9.37 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 스릴러]
4 37886 클레멘타인 Clementine 2004 9.35 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 드라마]
... ... ... ... ... ... ... ... ...
195 41334 옹박 - 두번째 미션 The Protector 2005 8.28 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 범죄, 드라마, 스릴러]
196 162249 램페이지 RAMPAGE 2018 8.28 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 모험]
197 82473 캐리비안의 해적: 죽은 자는 말이 없다 Pirates of the Caribbean: Dead Men Tell No Tales 2017 8.28 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 모험, 코미디, 판타지]
198 31606 킬러들의 수다 Guns & Talks 2001 8.28 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 드라마, 코미디]
199 51082 7급 공무원 7th Grade Civil Servant 2009 8.28 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 코미디]

200 rows × 8 columns

action_df.to_csv('data/action_df.csv', index=False)
np.save('images/poster_action.npy', poster_action)
poster_ac = np.load('images/poster_action.npy')
print(poster_ac.shape)
poster_ac
(200, 256, 256, 3)





array([[[[171, 192, 215],
         [171, 193, 215],
         [171, 193, 215],
         ...,
         [183, 179, 175],
         [175, 172, 167],
         [170, 167, 162]],

        [[172, 194, 217],
         [172, 193, 216],
         [171, 192, 216],
         ...,
         [180, 177, 172],
         [182, 179, 174],
         [185, 182, 177]],

        [[167, 191, 213],
         [169, 192, 215],
         [168, 192, 215],
         ...,
         [207, 203, 199],
         [209, 205, 201],
         [211, 207, 202]],

        ...,

        [[ 65,  56,  47],
         [ 65,  56,  47],
         [ 67,  58,  49],
         ...,
         [ 17,  18,  21],
         [ 16,  17,  19],
         [ 14,  15,  18]],

        [[ 65,  56,  47],
         [ 66,  57,  48],
         [ 65,  56,  47],
         ...,
         [ 17,  18,  21],
         [ 16,  17,  20],
         [ 15,  16,  19]],

        [[ 63,  55,  46],
         [ 64,  55,  46],
         [ 64,  55,  46],
         ...,
         [ 17,  18,  20],
         [ 15,  16,  19],
         [ 16,  17,  21]]],


       [[[249, 255, 252],
         [238, 242, 239],
         [231, 232, 230],
         ...,
         [228, 228, 228],
         [240, 240, 240],
         [255, 255, 255]],

        [[255, 255, 255],
         [145, 144, 142],
         [ 55,  50,  49],
         ...,
         [ 30,  30,  30],
         [135, 135, 135],
         [255, 255, 255]],

        [[255, 255, 255],
         [135, 129, 129],
         [ 31,  21,  22],
         ...,
         [  0,   0,   0],
         [119, 119, 119],
         [255, 255, 255]],

        ...,

        [[255, 255, 255],
         [117, 117, 117],
         [  0,   0,   0],
         ...,
         [  0,   0,   0],
         [117, 117, 117],
         [255, 255, 255]],

        [[255, 255, 255],
         [134, 134, 134],
         [ 29,  29,  29],
         ...,
         [ 29,  29,  29],
         [133, 133, 133],
         [255, 255, 255]],

        [[255, 255, 255],
         [238, 238, 238],
         [228, 228, 228],
         ...,
         [227, 227, 227],
         [239, 239, 239],
         [255, 255, 255]]],


       [[[  6,   4,  19],
         [  5,   4,  12],
         [  4,   3,  11],
         ...,
         [  1,   1,   1],
         [  1,   1,   1],
         [  1,   1,   1]],

        [[ 24,  18,  58],
         [  6,   4,  18],
         [  5,   4,  11],
         ...,
         [  1,   1,   1],
         [  1,   1,   1],
         [  1,   1,   1]],

        [[ 38,  30,  89],
         [ 15,  11,  42],
         [  5,   4,  14],
         ...,
         [  1,   1,   1],
         [  1,   1,   1],
         [  1,   1,   1]],

        ...,

        [[  1,   1,   2],
         [  1,   1,   2],
         [  1,   1,   1],
         ...,
         [  4,   3,   9],
         [  2,   1,   6],
         [  2,   2,   5]],

        [[  1,   1,   2],
         [  1,   1,   1],
         [  1,   1,   1],
         ...,
         [  4,   3,   9],
         [  2,   2,   7],
         [  2,   1,   5]],

        [[  2,   1,   4],
         [  1,   1,   1],
         [  1,   1,   1],
         ...,
         [  3,   2,   7],
         [  2,   1,   5],
         [  1,   1,   3]]],


       ...,


       [[[107, 147, 155],
         [106, 146, 153],
         [107, 146, 154],
         ...,
         [214, 222, 204],
         [214, 223, 206],
         [213, 222, 204]],

        [[106, 146, 153],
         [106, 147, 154],
         [105, 147, 153],
         ...,
         [215, 222, 204],
         [214, 222, 206],
         [216, 223, 204]],

        [[106, 146, 154],
         [107, 147, 154],
         [104, 146, 152],
         ...,
         [215, 222, 204],
         [215, 222, 205],
         [210, 220, 204]],

        ...,

        [[ 16,  24,  34],
         [ 16,  23,  32],
         [ 16,  23,  32],
         ...,
         [ 16,  21,  28],
         [ 16,  20,  29],
         [ 16,  20,  29]],

        [[ 18,  23,  34],
         [ 17,  23,  34],
         [ 17,  23,  32],
         ...,
         [ 17,  21,  30],
         [ 14,  20,  30],
         [ 15,  20,  30]],

        [[ 18,  22,  33],
         [ 18,  23,  35],
         [ 17,  22,  33],
         ...,
         [ 20,  22,  30],
         [ 45,  29,  30],
         [ 28,  22,  29]]],


       [[[237, 254, 248],
         [250, 250, 250],
         [255, 249, 253],
         ...,
         [252, 250, 250],
         [254, 250, 250],
         [254, 252, 254]],

        [[242, 254, 249],
         [252, 251, 251],
         [244, 243, 245],
         ...,
         [236, 244, 239],
         [253, 253, 253],
         [254, 252, 255]],

        [[248, 253, 251],
         [255, 253, 254],
         [226, 232, 232],
         ...,
         [163, 172, 164],
         [238, 239, 241],
         [254, 253, 255]],

        ...,

        [[252, 252, 252],
         [255, 255, 255],
         [188, 188, 188],
         ...,
         [113, 113, 113],
         [220, 220, 220],
         [255, 255, 255]],

        [[251, 251, 251],
         [251, 251, 251],
         [249, 249, 249],
         ...,
         [152, 152, 152],
         [225, 225, 225],
         [255, 255, 255]],

        [[252, 252, 252],
         [253, 253, 253],
         [254, 254, 254],
         ...,
         [252, 252, 252],
         [252, 252, 252],
         [255, 255, 255]]],


       [[[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [179,  47,  35],
         [179,  47,  35],
         [179,  47,  35]],

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [180,  48,  33],
         [179,  47,  33],
         [179,  47,  33]],

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [181,  49,  32],
         [181,  49,  35],
         [181,  49,  35]],

        ...,

        [[ 96,   4,   8],
         [ 58,   2,   1],
         [ 15,   1,   0],
         ...,
         [136,  27,  33],
         [138,  27,  33],
         [138,  27,  33]],

        [[ 82,   4,   4],
         [ 32,   1,   2],
         [ 12,   2,   3],
         ...,
         [136,  27,  32],
         [138,  27,  33],
         [138,  27,  33]],

        [[ 52,   1,   3],
         [ 10,   1,   2],
         [ 38,   4,   6],
         ...,
         [138,  27,  33],
         [137,  26,  32],
         [138,  27,  33]]]], dtype=uint8)

Total Dataframe

poster를 제외한 나머지 변수들 포함된 데이터

total_df = pd.DataFrame(columns=['code', 'title_kor', 'title_eng', 'year',
                                 'rating', 'rank', 'link', 'genre'])
total_df
code title_kor title_eng year rating rank link genre
total_df = total_df.append(drama_df).append(horror_df).append(romance_df)\
                   .append(comedy_df).append(animation_df).append(crime_df).append(action_df)
total_df = total_df.reset_index(drop=True)
total_df
code title_kor title_eng year rating rank link genre
0 171539 그린 북 Green Book 2018 9.59 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마]
1 174830 가버나움 Capharnaum 2018 9.58 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마]
2 151196 원더 Wonder 2017 9.49 전체 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마]
3 169240 아일라 Ayla: The Daughter of War 2017 9.48 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마, 전쟁]
4 157243 당갈 Dangal 2016 9.47 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [드라마, 액션]
... ... ... ... ... ... ... ... ...
1395 41334 옹박 - 두번째 미션 The Protector 2005 8.28 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 범죄, 드라마, 스릴러]
1396 162249 램페이지 RAMPAGE 2018 8.28 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 모험]
1397 82473 캐리비안의 해적: 죽은 자는 말이 없다 Pirates of the Caribbean: Dead Men Tell No Tales 2017 8.28 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 모험, 코미디, 판타지]
1398 31606 킬러들의 수다 Guns & Talks 2001 8.28 15세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 드라마, 코미디]
1399 51082 7급 공무원 7th Grade Civil Servant 2009 8.28 12세 관람가 https://movie.naver.com/movie/bi/mi/basic.nhn?... [액션, 코미디]

1400 rows × 8 columns

total_df.to_csv('data/total_df.csv', index=False)

poster 데이터

poster_total = np.r_[np.r_[np.r_[np.r_[np.r_[np.r_[poster_d, poster_h], poster_r], poster_c], poster_a], poster_cr], poster_ac]
print(poster_total.shape)
poster_total
(1400, 256, 256, 3)





array([[[[  0, 100, 116],
         [  0, 100, 116],
         [  0, 100, 117],
         ...,
         [  1, 115, 141],
         [  0, 116, 141],
         [  1, 117, 142]],

        [[  0, 100, 116],
         [  0, 100, 115],
         [  0, 101, 118],
         ...,
         [  0, 115, 141],
         [  0, 116, 141],
         [  0, 116, 141]],

        [[  0, 101, 116],
         [  0, 101, 116],
         [  0, 101, 118],
         ...,
         [  1, 117, 142],
         [  1, 117, 142],
         [  1, 117, 142]],

        ...,

        [[  0, 106, 130],
         [  1, 107, 131],
         [  0, 108, 132],
         ...,
         [  0, 105, 127],
         [  0, 105, 127],
         [  0, 105, 127]],

        [[  0, 106, 130],
         [  0, 106, 130],
         [  1, 107, 131],
         ...,
         [  1, 104, 127],
         [  1, 104, 126],
         [  1, 104, 127]],

        [[  0, 106, 130],
         [  0, 106, 130],
         [  1, 107, 131],
         ...,
         [  1, 104, 127],
         [  1, 104, 127],
         [  2, 104, 127]]],


       [[[227, 221, 218],
         [231, 231, 227],
         [237, 238, 239],
         ...,
         [ 87,  56,  93],
         [ 89,  58,  94],
         [ 89,  57,  92]],

        [[231, 229, 226],
         [232, 231, 227],
         [237, 237, 236],
         ...,
         [ 89,  58,  95],
         [ 88,  58,  93],
         [ 89,  58,  94]],

        [[236, 235, 233],
         [236, 236, 235],
         [237, 237, 236],
         ...,
         [ 88,  58,  96],
         [ 88,  58,  95],
         [ 89,  58,  94]],

        ...,

        [[104, 129, 151],
         [105, 128, 150],
         [106, 130, 152],
         ...,
         [125, 125,  84],
         [123, 124,  94],
         [ 97, 104, 102]],

        [[103, 129, 151],
         [102, 128, 150],
         [104, 129, 152],
         ...,
         [115, 118,  87],
         [111, 115,  91],
         [ 87,  96,  99]],

        [[100, 129, 149],
         [100, 128, 149],
         [103, 128, 150],
         ...,
         [106, 112,  85],
         [ 97, 106,  86],
         [ 79,  85,  93]]],


       [[[242, 225, 217],
         [241, 223, 213],
         [241, 222, 212],
         ...,
         [250, 250, 231],
         [253, 252, 248],
         [253, 253, 252]],

        [[252, 252, 252],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        [[247, 234, 229],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [255, 255, 255],
         [255, 255, 255],
         [255, 255, 255]],

        ...,

        [[100,  97,  86],
         [ 98,  95,  85],
         [ 96,  95,  85],
         ...,
         [114, 102,  86],
         [111,  98,  79],
         [114,  97,  77]],

        [[100,  99,  90],
         [ 89,  88,  79],
         [ 76,  73,  65],
         ...,
         [114, 100,  81],
         [118, 102,  83],
         [119, 100,  80]],

        [[ 90,  89,  83],
         [ 89,  86,  80],
         [ 88,  83,  76],
         ...,
         [102,  90,  73],
         [104,  92,  76],
         [103,  88,  74]]],


       ...,


       [[[107, 147, 155],
         [106, 146, 153],
         [107, 146, 154],
         ...,
         [214, 222, 204],
         [214, 223, 206],
         [213, 222, 204]],

        [[106, 146, 153],
         [106, 147, 154],
         [105, 147, 153],
         ...,
         [215, 222, 204],
         [214, 222, 206],
         [216, 223, 204]],

        [[106, 146, 154],
         [107, 147, 154],
         [104, 146, 152],
         ...,
         [215, 222, 204],
         [215, 222, 205],
         [210, 220, 204]],

        ...,

        [[ 16,  24,  34],
         [ 16,  23,  32],
         [ 16,  23,  32],
         ...,
         [ 16,  21,  28],
         [ 16,  20,  29],
         [ 16,  20,  29]],

        [[ 18,  23,  34],
         [ 17,  23,  34],
         [ 17,  23,  32],
         ...,
         [ 17,  21,  30],
         [ 14,  20,  30],
         [ 15,  20,  30]],

        [[ 18,  22,  33],
         [ 18,  23,  35],
         [ 17,  22,  33],
         ...,
         [ 20,  22,  30],
         [ 45,  29,  30],
         [ 28,  22,  29]]],


       [[[237, 254, 248],
         [250, 250, 250],
         [255, 249, 253],
         ...,
         [252, 250, 250],
         [254, 250, 250],
         [254, 252, 254]],

        [[242, 254, 249],
         [252, 251, 251],
         [244, 243, 245],
         ...,
         [236, 244, 239],
         [253, 253, 253],
         [254, 252, 255]],

        [[248, 253, 251],
         [255, 253, 254],
         [226, 232, 232],
         ...,
         [163, 172, 164],
         [238, 239, 241],
         [254, 253, 255]],

        ...,

        [[252, 252, 252],
         [255, 255, 255],
         [188, 188, 188],
         ...,
         [113, 113, 113],
         [220, 220, 220],
         [255, 255, 255]],

        [[251, 251, 251],
         [251, 251, 251],
         [249, 249, 249],
         ...,
         [152, 152, 152],
         [225, 225, 225],
         [255, 255, 255]],

        [[252, 252, 252],
         [253, 253, 253],
         [254, 254, 254],
         ...,
         [252, 252, 252],
         [252, 252, 252],
         [255, 255, 255]]],


       [[[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [179,  47,  35],
         [179,  47,  35],
         [179,  47,  35]],

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [180,  48,  33],
         [179,  47,  33],
         [179,  47,  33]],

        [[255, 255, 255],
         [255, 255, 255],
         [255, 255, 255],
         ...,
         [181,  49,  32],
         [181,  49,  35],
         [181,  49,  35]],

        ...,

        [[ 96,   4,   8],
         [ 58,   2,   1],
         [ 15,   1,   0],
         ...,
         [136,  27,  33],
         [138,  27,  33],
         [138,  27,  33]],

        [[ 82,   4,   4],
         [ 32,   1,   2],
         [ 12,   2,   3],
         ...,
         [136,  27,  32],
         [138,  27,  33],
         [138,  27,  33]],

        [[ 52,   1,   3],
         [ 10,   1,   2],
         [ 38,   4,   6],
         ...,
         [138,  27,  33],
         [137,  26,  32],
         [138,  27,  33]]]], dtype=uint8)
np.save('images/poster_total.npy', poster_total)