250314 온라인강의_시각화(plotly, folium), 데이터분석 실습 (크롤링, 전처리, 리텐션, RFM)

by minimin227 2025. 3. 15. 00:44

Part 2. 파이썬을 이용한 데이터 분석

Ch 02. 데이터 시각화

03. plotly 이해하기

인터렉티브한 그래프를 그리는 라이브러리

import plotly.express as px

pip install plotly 로 모듈 설치

기본 문법

fig = px.그래프종류(data_frame=데이터, x=X축 컬럼, y=Y축 컬럼, color=범례 컬럼, title=제목,
                 labels=dict(X축 컬럼=X축 라벨, Y축 컬럼=Y축 라벨),
                 width=그래프 가로길이, height=그래프 세로길이, text_auto=True/False)
fig.show()

fig = px.bar(data_frame=df_groupby1, x='island', y='body_mass_g', color='sex', barmode='group', text_auto='.0d',
    width=700, height=500, title='island별 몸무게 평균', 
    labels=dict(body_mass_g='몸무게(g)', island='', sex='성별'))
fig.show()

스타일 설정하기

template=템플릿명
color_discrete_sequence = 컬러맵명 #범주형 데이터
color_continuous_scale= 컬러맵명 #연속형 데이터

템플릿 적용
- for 문으로 여러 템플릿 적용된 그래프들 순차적으로 그리기

for temp in ['ggplot2', 'seaborn', 'simple_white', 'plotly', 'plotly_white', 'plotly_dark']:
    fig = px.bar(data_frame=df_groupby1, x='island', y='body_mass_g', color='sex', barmode='group', text_auto='.0d', width=700, height=500, title=f'템플릿: {temp}', labels=dict(body_mass_g='몸무게(g)', island='', sex='성별'), template=temp)
    fig.show()

컬러맵 적용
sequential이란 이름의 연속적인 견본들 보여주기 _continuous 빼면 불연속*

fig = px.colors.sequential.swatches_continuous()
fig.show()

qualitative라는 견본들 (불연속)

fig = px.colors.qualitative.swatches()
fig.show()

color_discrete_sequence=color_map : 불연속 컬러맵 불러올 때
color_continuous_scale=color_map : 연속 컬러맵

색을 다르게 하는 조건이 연속적인지 불연속적인지에 따라 잘 선택 해야함

HTML 파일로 저장하기

fig.write_html(파일경로 및 파일명)
파일명만 쓰면 python 실행하고 있는 파일 경로

04. plotly로 유형별 그래프 그리기

산점도 (scatter plot)

px.scatter(data_frame=데이터, x=X축 컬럼, y=Y축 컬럼, color=색, trendline='ols') #trendline은 추세선 추가

fig = px.scatter(data_frame=penguins, x='bill_length_mm', y='bill_depth_mm', color='sex', facet_col='island'
, color_discrete_sequence=px.colors.qualitative.Set2, template='plotly_white')
fig.show()

fig = px.scatter(data_frame=penguins, x='bill_length_mm', y='bill_depth_mm', color='sex', facet_col='island', trendline='ols'
, color_discrete_sequence=px.colors.qualitative.Set2, template='plotly_white')
fig.show()

trendline 추가하려면 pip install statsmodels 로 모듈 설치

히스토그램 (histogram)

px.histogram(data_frame=데이터, x=X축 컬럼, color=색) #히스토그램

상자그림 (boxplot)

px.box(data_frame=데이터, x=X축 컬럼, y=Y축 컬럼, color=색) #상자그림

fig = px.box(data_frame=penguins, x='body_mass_g', y='species', color='sex'
, color_discrete_sequence=px.colors.qualitative.Set2, template='plotly_white')
fig.show()

막대 그래프 (bar plot)

px.bar(data_frame=데이터, x=X축 컬럼, y=Y축 컬럼, color=색, barmode='group') #쌓아서 올리지 않으면 barmode = 'group'을 추가한다
#groupbygraph

# titanic 데이터 불러와서 'sex'와 'class'에 따른 'survived' groupby 평균

titanic = sns.load_dataset('titanic')
titanic_groupby = titanic.groupby(['sex','class'])[['survived']].mean().reset_index() 

# grouby로 만든 그래프일 경우 .reset_index로 'sex'와 'class'를 열로 만들어야 그래프가 그려진다

fig = px.bar(data_frame=titanic_groupby, x='class', y='survived', color='sex'
, color_discrete_sequence=px.colors.qualitative.Set2, template='plotly_white')
fig.show()

`barmode='group'

![[IMG-130716.png]]

text_auto='.2f'

# 'alone' 추가
titanic_groupby1 = titanic.groupby(['sex','class','alone'])[['survived']].mean().reset_index()
titanic_groupby1

facet_col='alone'

선 그래프 (line plot)

px.line(data_frame=데이터, x=X축 컬럼, y=Y축 컬럼, color=색)

flights = sns.load_dataset("flights")

may_flights = flights.query('month == "May"')
fig = px.line(data_frame=may_flights, x="year", y="passengers"
, color_discrete_sequence=px.colors.qualitative.Set2, template='plotly_white')
fig.show()

fig = px.line(data_frame=flights, x="year", y="passengers", color='month'
, color_discrete_sequence=px.colors.qualitative.Set2, template='plotly_white')
fig.show()

히트맵 (heat map)

px.imshow(데이터, text_auto=텍스트포맷, color_continuous_scale=컬러맵)

titanic_corr = titanic[['survived','age','fare','sibsp','pclass']].corr()

fig = px.imshow(titanic_corr, text_auto='.2f', color_continuous_scale='YlOrBr'
,width=500)
fig.show()

titanic_pivot = pd.pivot_table(data=titanic, index='sex', columns='class', values='survived', aggfunc='mean')

fig = px.imshow(titanic_pivot, text_auto='.2f', color_continuous_scale='Purples', width=500)
fig.show()

파이 차트 (pie chart)

px.pie(data_frame=데이터, values=값, names=라벨)

# plotly의 샘플 데이터 tips
df = px.data.tips()

fig = px.pie(df, values='tip', names='day', color_discrete_sequence=px.colors.qualitative.Pastel, width=500)
fig.show()

05. folium 이해하기

pip install folium 으로 모듈 추가
import folium

특정 장소의 지도 시각화 하기

네이버, 카카오, 구글 지도에서 원하는 위치 검색하고 공유 버튼을 통해 URL 복사
링크에 복사한 url을 붙여넣고 위도와 경도를 확인

f = folium.Figure(width=가로길이, height=세로길이)
m = folium.Map(location=[위도, 경도], zoom_start=줌할정도).add_to(f)
m.save('test.html') #지도 저장
m

f = folium.Figure(width=700, height=500)
m = folium.Map(location=[37.510781008592716, 127.09607026177875], zoom_start=16).add_to(f)
m

마커 추가하기

#장소 표시 마커
folium.Marker([위도, 경도]
                , tooltip=마우스 오버시 나타남
                , popup=클릭시 나타남
                , icon=folium.Icon(color=색, icon=모양)).add_to(지도)
#원 형태 마커
folium.CircleMarker([위도, 경도]
                , radius=범위
                , color=색).add_to(지도)

folium.CircleMarker([37.510781008592716, 127.09607026177875]
              , color = 'red'
              , radius = 50).add_to(m)
m

folium으로 지리 데이터 시각화하기

서울시의 이디야, 투썸플레이스 위치를 시각화하고 두 카페의 지리적 분포 시각화

import json
import folium
import pandas as pd

서울시 구별 경계 데이터 가져오기

서울시 구별 경계 데이터

geo_path = 'data/seoul_municipalities_geo_simple.json'
geo_json = json.load(open(geo_path, encoding='utf-8'))

서울시 상가 정보 데이터

df = pd.read_csv('data/소상공인시장진흥공단_상가(상권)정보_서울_202306.csv', low_memory=False)

카페별로 데이터를 전처리하고 EDA

[[250312 온라인강의#특정 조건을 충족하는 데이터 추출하기]]

# 전체 데이터에서 카페만 골라 cafe에 저장
cafe = df.query('상권업종소분류명 == "카페"')

# 이디야 이름을 갖는 상호명 ediya에 저장
ediya = cafe.loc[cafe['상호명'].str.contains('이디야'),]
# 투썸플레이스 이름을 갖는 상호명 twosome에 저장
twosome = cafe.loc[cafe['상호명'].str.contains('투썸플레이스'),]

# ediya_count, twosome_count에 시군구별 수 저장

ediya_count = ediya.groupby('시군구명').size().to_frame().reset_index().rename({0:'count'}, axis=1).sort_values('count', ascending=False)
ediya_count

twosome_count = twosome.groupby('시군구명').size().to_frame().reset_index().rename({0:'count'}, axis=1).sort_values('count', ascending=False)

# .size() : '시군구명'에서 그룹들 각각의 행의 수를 pandas Series 반환
# .to_frame() : 데이터프레임으로 변환
# .reset_index() : 인덱스였던 '시군구명'을 컬럼으로
# .rename({0:'count'}, axis=1) : 초기값이었던 0 열의 이름을 count로 변환
# .sort_values('count', ascending=Flase) : count 열 내림차순

forlium으로 지도에 시각화

# 서울 중심을 기준으로 줌 11로 서울 전체 보이게 지도 생성

f = folium.Figure(width=700, height=500)
m = folium.Map(location=[37.566535, 126.9779692], zoom_start=11).add_to(f)
m

# 서울시 구별 경계 데이터로 영역 색 설정

folium.Choropleth(geo_data = geo_json, fill_color = 'gray').add_to(m)
m

# 구별 이디야 매장수 색

f = folium.Figure(width=700, height=500)
m = folium.Map(location=[37.566535, 126.9779692], zoom_start=11).add_to(f)
folium.Choropleth(geo_data = geo_json
                  , data=ediya_count
                  , columns=['시군구명', 'count']
                  , key_on='properties.name' # geo_json에 시군구명이 담겨있는 컬럼이 properties.name (데이터에 .code, .name_eng, .base_year 들이 추가로 있음)
                  , fill_color = 'YlGn'
                  , fill_opacity = 0.7
                  , line_opacity = 0.7
                  , legend_name = '서울시 구별 이디야 매장수').add_to(m)
m

# 구별 투썸 매장수 색

f = folium.Figure(width=700, height=500)
m = folium.Map(location=[37.566535, 126.9779692], zoom_start=11).add_to(f)
folium.Choropleth(geo_data = geo_json
                  , data=twosome_count
                  , columns=['시군구명', 'count']
                  , key_on='properties.name'
                  , fill_color = 'BuPu'
                  , fill_opacity = 0.7
                  , line_opacity = 0.7
                  , legend_name = '서울시 구별 투썸플레이스 매장수').add_to(m)
m

# ediya_twsome에 ediya_count와 twosome_count를 시군구명을 기준으로 merge, count에 suffixes를 각각 추가
ediya_twosome = ediya_count.merge(twosome_count, on='시군구명', suffixes=('_이디야','_투썸'))

ediya_twosome['이디야_ratio'] = ediya_twosome['count_이디야'] / ediya_twosome['count_이디야'].sum()
ediya_twosome['투썸_ratio'] = ediya_twosome['count_투썸'] / ediya_twosome['count_투썸'].sum()
ediya_twosome['투썸 상대적 비율'] = ediya_twosome['투썸_ratio'] / ediya_twosome['이디야_ratio']

투썸 상대적 비율 그리기

f = folium.Figure(width=700, height=500)
m = folium.Map(location=[37.566535, 126.9779692], zoom_start=11).add_to(f)
folium.Choropleth(geo_data = geo_json
                  , data=ediya_twosome
                  , columns=['시군구명', '투썸 상대적 비율']
                  , key_on='properties.name'
                  , fill_color = 'RdPu'
                  , fill_opacity = 0.7
                  , line_opacity = 0.7
                  , legend_name = '서울시 구별 투썸 상대적 비율').add_to(m)
m

# ediya, twosome에서 상호명, 위도, 경도 데이터 '*_df'에 저장하고 'kind'열을 만들어 '이디야' 저장

ediya_df = ediya[['상호명','경도','위도']].copy()
ediya_df['kind'] = '이디야'

twosome_df = twosome[['상호명','경도','위도']].copy()
twosome_df['kind'] = '투썸'

'*_df들 병합'

dff = pd.concat([ediya_df, twosome_df])
dff.head()

.concat으로 행 추가[[250312 온라인강의#04. 데이터 가공_인덱스, 행, 열#행]]

from _plotly_utils.basevalidators import TitleValidator
f = folium.Figure(width=700, height=500)
m = folium.Map(location=[37.566535, 126.9779692], zoom_start=11).add_to(f)

for idx in dff.index:
    lat = dff.loc[idx, '위도']
    long = dff.loc[idx, '경도']
    title = dff.loc[idx, '상호명']

    if dff.loc[idx, 'kind'] == "이디야":
        color = '#1d326c'
    else:
        color = '#D70035'
    folium.CircleMarker([lat, long]
                        , radius=3
                        , color = color
                        , tooltip = title).add_to(m)

Part 3. 데이터 분석 프로젝트

Ch 01. 데이터 수집

01. 다양한 공개데이터 플랫폼 살펴보기

국내 사이트
해외 사이트
- 캐글
- awesomedata

02. 웹 크롤링

Pandas 활용하기

pip install html5lib : html5lib 모듈 설치

import pandas as pd
url = 'https://finance.naver.com/item/main.nhn?code=035720' # 종목에 따라 code=**가 바뀜
table_df_list = pd.read_html(url, encoding='euc-kr')
table_df = table_df_list[3]

pip install finance-datareader

import FinanceDataReader

# 코스피 데이터 불러오기
kospi = FinanceDataReader.StockListing("KOSPI")
kospi

kospi_info_list = []
for code in kospi['Code'][:10]:
    url = f'https://finance.naver.com/item/main.nhn?code={code}'
    table_df_list = pd.read_html(url, encoding='euc-kr')
    table_df = table_df_list[3]
    kospi_info_dic = {}
    kospi_info_dic['code'] = code
    kospi_info_dic['table'] = table_df
    kospi_info_list.append(kospi_info_dic)

# <kospi_info_list에 상위 10개 종목의 재무재표 저장하기>

# 빈 kospi_info_list를 만들기
# 변수 code가 kospi의 'Code'열 0~9 까지의 Series의 값을 순서대로 받기
# 변수 url의 {}자리에 code 값을 넣기
# table_df_lsit에 url(네이버금융 **종목)의 테이블들을 담은 리스트 불러오기
# table_df에 리스트의 4번 째 테이블 저장
# 빈 kospi_info_dic 생성
# kospi_info_dic의 'code' key에 code 값 저장
# kospi_info_dic의 'table' key에 table_df 값 저장
# for 문 밖에서 만들어 두었던 kospi_info_list에 kospi_info_dic 추가

BeautifulSoup 활용하기

크롤링의 과정
- 파이썬으로 웹 서버에 정보 요청하고 HTML 데이터 가져오기
- 데이터 파싱 (내용 추출)
- 원하는 정보 저장

import requests                         # 파이썬으로 정보를 요청하는 라이브러리
from bs4 import BeautifulSoup as bs     # 파싱하는 라이브러리

keyword = '제주도'
# url = f'https://search.naver.com/search.naver?query={keyword}&nso=&where=blog&sm=tab_opt'
url = f'https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=0&ie=utf8&query={keyword}'
res = requests.get(url)  # 정보 요청
soup = bs(res.text, 'html.parser') # 내용 추출해서 soup에 저장

원하는 웹 페이지 - 브라우저의 개발자 도구(F12)

크롤링 할 부분 구조 파악

title = [i.text for i in soup.find_all('a', class_='title_link')] 

# [~ for ~ in ~] : list compehension
# i.text: i에 있는 text 반환
# .find_all('a', class_='title_link'): 모든 <a> 태그의 'title_link' 클래스를 찾아서 리스트로 반환

date = [i.find('span').text for i in soup.find_all('div', class_='user_info')]

content = [i.text for i in soup.find_all('a', class_='dsc_link')]

df = pd.DataFrame({'title':title, 'date':date, 'content':content})
df

Ch 02. 영화 데이터를 활용한 영화 흥행 요인 분석

01. 데이터 둘러보기 & 질문 만들기

movies 데이터 (tmdb_5000_movies.csv)

budget: 영화 예산 (단위: 달러)
genres: 모든 장르
homepage: 공식 홈페이지
id: 각 영화당 unique id
original_language: 원 언어
original_title: 원 제목
overview: 간략한 설명
popularity: TMDB에서 제공하는 인기도
production_companies: 모든 제작사
production_countries: 모든 제각국가
release_date: 개봉일
revenue: 흥행 수익 (단위: 달러)
runtime: 상영 시간
spoken_language: 사용된 모든 언어
status: 개봉 여부
title: 영문 제목
vote_avearage: TMDB에서 받은 평점 평균
vote_count: TMDB에서 받은 투표수

credits 데이터 (tmdb_5000_credits.csv)

movie_id: 각 영화당 unique id
title: 영문 제목
cast: 모든 출연진
crew: 모든 제작진

질문 만들기

연도별 흥행 수익은?
가장 흥행한 영화 TOP 10은?
흥행에 가장 성공한 감독과 배우는?
장르와 흥행 수익
- 흥행 수익이 좋은 장르는?
- 시간의 흐름에 따라 유행하는 장르가 바뀌는가?
  - 월별로 흥행하는 장르가 있는가?
수익과 예산, 투표수, 평점과의 상관관계는?
ROI(예산 대비 수익)가 높으면서 흥행에 성공한 영화의 특징은?

02. 데이터 전처리

필요한 컬럼만 남기기

movies_df = movies[['id','budget','genres','title','release_date','revenue','vote_average','vote_count']]
credits_df = credits[['movie_id','crew','cast']]

데이터 결합
- movies_df의 'id'와 credits_df의 movie_id가 두 데이터의 교집합

data = pd.merge(movies_df, credits_df, left_on = 'id', right_on = 'movie_id').drop('movie_id', axis=1)

roi 컬럼 만들기

data['roi'] = data['revenue'] / data['budget']

감독 컬럼 만들기

data['crew']

import ast  # 데이터를 파이썬 자료형으로 바꾸는 라이브러리

# data['crew'][0] : str 타입
# ast.literal_eval(data['crew'][0]) : list 타입

data['crew'] = data['crew'].apply(ast.literal_eval)  # 'crew'열을 list 형태로 변환해서 덮어씌우기

# 감독들의 이름을 반환하는 함수 정의의

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']

# data에 director 열 만들고 전체 행에 crew 열에서 감독 이름 얻는 함수 실행

data['director'] = data['crew'].apply(get_director)

배우 컬럼 만들기
ast.literal_eval(data['cast'][0])

dictionary 형태로 cast가 하나씩 있는 구조

[i['name'] for i in ast.literal_eval(data['cast'][0])]
: 첫 번째 영화의 cast(data['cast'][0])를 리스트로 만든 상태에서 i가 그 리스트를 순서대로 참조하고, name 키를 리스트의 요소로 저장하는 list comprehension

# apply로 전체 행에 대해 list comprehesion 함수 적용

data['cast_name'] = data['cast'].apply(lambda x: [i['name'] for i in ast.literal_eval(x)])

# x = data['cast']

장르 컬럼 만들기
data['genres'] = data['genres'].apply(ast.literal_eval)
영화 하나에 여러 장르가 있는 구조

def get_genres(x):  # 첫번째 장르의 이름만 반환하는 함수
    if len(x) > 0:
        return x[0]['name']

data['main_genre'] = data['genres'].apply(get_genres) # apply로 get_genres 함수를 data['genres']에 써서 첫번째 장르만 main_genre 열에 저장
data.head()

여러 장르를 다 포함하도록 전처리 필요

데이터 타입 변경

data['release_date'] = pd.to_datetime(data['release_date'], format='%Y-%m-%d')
data['id'] = data['id'].astype(str)

연도, 월 컬럼 만들기

data['year'] = data['release_date'].dt.year
data['month'] = data['release_date'].dt.month

결측치 제거
data.dropna(inplace=True)
inplace=True 로 data 바꿈

03. EDA, 시각화, 분석 (1)

연도별 흥행 수익

revenue_by_year = data.groupby('year')[['revenue']].sum().reset_index()
fig = px.line(data_frame=revenue_by_year, x="year", y="revenue")

fig.update_layout(yaxis_type="log") # y-축 log 스케일

fig.show()

로그 스케일로 그렸을 때 직선에 가깝다.
매년 일정 비율로 증가했다는 뜻
그런데 2000년 이후 그 비율이 살짝 꺾임
2000년 이후 영화의 흥행 수익 성장률이 줄었다.
가장 흥행한 영화 TOP 10

top = data.groupby('title')['revenue'].sum().reset_index().sort_values('revenue', ascending=False).head(10)
fig = px.bar(data_frame=top, x='title', y='revenue', title=f"흥행 수익 TOP 10 영화")
fig.show()

예산, 투표수 TOP 10

title_dic = {'budget':'예산', 'vote_count':'투표수'}
for y in ['budget','vote_count']:
    top = data.groupby('title')[[y]].sum().reset_index().sort_values(y, ascending=False).head(10)
    fig = px.bar(data_frame=top, x='title', y=y, title=f"{title_dic[y]} TOP 10 영화")
    fig.show()

data_corr = data[['budget', 'vote_count']].corr() 로 둘 사이 상관관계 분석

흥행에 가장 성공한 감독과 배우

top_director = data.groupby(['director'])['revenue'].sum().reset_index().sort_values('revenue', ascending=False).head(10)
fig = px.bar(data_frame=top_director, x='director', y='revenue', title=f"흥행 수익 TOP 10 감독")
fig.show()

revenue_cast = data[['revenue', 'cast_name']].explode('cast_name')

data에서 revenue와 cast_name 선택해서
explode로 cast_name의 각 요소마다 한 행이고 revenue와 cast_name열인 dataframe 만들고 각revenue 값들은 원래 revenue 값들로 채움

top_cast = revenue_cast.groupby('cast_name')[['revenue']].sum().reset_index().sort_values('revenue', ascending=False).head(10)
fig = px.bar(data_frame=top_cast, x='cast_name', y='revenue', title=f"흥행 수익 TOP 10 배우")
fig.show()

# Assuming 'data' is your original DataFrame
# Calculate the number of cast members for each movie
data['num_cast'] = data['cast_name'].apply(len)

# Adjust the revenue by dividing by the number of cast members
data['adjusted_revenue'] = data['revenue'] / data['num_cast']

# Explode the 'cast_name' column
adjrevenue_cast = data[['adjusted_revenue', 'cast_name']].explode('cast_name')

top_cast = adjrevenue_cast.groupby('cast_name')[['adjusted_revenue']].sum().reset_index().sort_values('adjusted_revenue', ascending=False).head(10)

# Create the bar plot
fig = px.bar(data_frame=top_cast, x='cast_name', y='adjusted_revenue', title="흥행 수익 TOP 10 배우")
fig.show()

영화 마다의 흥행 수익을 참여 배우의 수로 나눠서 고려할 경우 위 결과와 차이가 난다.

장르와 흥행 수익

fig = px.box(data_frame = data, y = 'main_genre', x = 'revenue', hover_name = 'title')
fig.show()

genre_avg_revenue = data.groupby('main_genre')[['revenue']].mean().reset_index()
fig = px.bar(data_frame = genre_avg_revenue, x = 'main_genre', y = 'revenue', title = '장르별 흥행 수익 평균')
fig.show()

!

genre_sum_revenue = data.groupby('main_genre')[['revenue']].sum().reset_index()
fig = px.bar(data_frame = genre_sum_revenue, x = 'main_genre', y = 'revenue', title = '장르별 흥행 수익 합계')
fig.show()

revenue_by_year_genre = data.query('year >= 1999').groupby(['year','main_genre'])[['revenue']].sum().reset_index()
fig = px.bar(data_frame=revenue_by_year_genre, x="year", y="revenue", color='main_genre', color_discrete_sequence=px.colors.qualitative.Light24_r)
fig.show()

![[IMG-210819.png]]

수익, 예산, 득표수, 평점과의 상관관계

data[['budget','revenue','vote_average','vote_count']].corr()

fig = px.imshow(data[['budget','revenue','vote_average','vote_count']].corr(), text_auto='.2f', color_continuous_scale='Purp')
fig.show()

for x in ['budget', 'vote_count', 'vote_average']:
    fig = px.scatter(data_frame = data, x = x, y = 'revenue', hover_name = 'title', size = 'revenue', color = 'revenue'
    , color_continuous_scale = px.colors.sequential.Sunsetdark, width = 700, height = 600, trendline = 'ols')
    fig.show()

흥행 TOP 100 필터링

top100 = data.sort_values('revenue', ascending=False).head(100)

fig = px.imshow(top100[['budget','revenue','vote_average','vote_count']].corr(), text_auto='.2f', color_continuous_scale='Mint')
fig.show()

ROI(예산 대비 수익)가 높으면서 흥행에 성공한 영화의 특징

top300 = data.sort_values('revenue', ascending=False).head(300)

fig = px.scatter(data_frame = top300, x = 'roi', y = 'revenue', hover_name = 'title', size = 'revenue', color = 'main_genre',
color_discrete_sequence=px.colors.qualitative.Light24, width = 700, height = 600)
fig.show()

fig = px.box(data_frame = top300, y = 'main_genre', x = 'roi', hover_name = 'title')
fig.show()

Ch 03. 유통 데이터 - 리텐션과 RFM 분석

01. 데이터 둘러보기 & 질문 만들기

data = pd.read_csv(ecommerce_data.csv)

InvoiceNo: 영수증번호
StockCode: 상품번호
Description: 상품명
Quantity: 판매수량
InvoiceDate: 결제날짜
UnitPrice: 개당 가격
CustomerID: 고객번호
Country: 나라

질문 만들기

시간의 흐름에 따라 매출, 주문고객수, 주문단가의 추이는 어떻게 달라지는가?
리텐션 분석: 시간의 흐름에 따라 고객들은 얼마나 남고 얼마나 이탈했는가?
RFM 분석: 고객의 행동에 따라 고객을 유형화 해보자.
리텐션 분석이란?
- 유저가 제품을 사용한 이후 일정 기간이 지난 시점에 제품을 계속 사용하고 있는지 유저의 잔존과 이탈을 트래킹하는 분석
- Day0에 앱에 방문한 유저 중 Day1에 다시 재방문한 유저의 비율이 리텐션
- 일반적으로 리텐션이 높으면 유저가 서비스를 주기적으로 사용한다는 뜻으로 해석할 수 있어, 유저의 참여와 충성도 같은 지표를 높이기 위한 제품 방향성을 정하는데 중요한 지표로 활용된다.
RFM 분석이란?
- Recency, Frequency, Monetary를 기반으로 고객을 유형화하는 방법
  - Recency (최근성): 고객이 얼마나 최근에 구매를 했는지
  - Frequency (빈도): 고객이 얼마나 자주 구매를 하는지
  - Monetary (금액): 고객이 구매한 총 금액
- 고객 유형을 세분화하여 맞춤형 전략을 구상할 수 있다.
  - 예: 총 구매금액은 낮지만 자주 방문하는 유저 vs 최근에 큰 금액을 구매했지만 자주 방문하지는 않았던 유저

02. 데이터 전처리

데이터 확인
결측치 제거

#고객 분석을 할 것이므로 CustomerID가 없는 행은 제거한다.
data.dropna(subset=['CustomerID'], inplace=True)
data.info()

데이터 타입 변경

data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'], format='%m/%d/%Y %H:%M')
data['CustomerID'] = data['CustomerID'].astype(int).astype(str)
data.info()

매출 컬럼 추가

data['amount'] = data['Quantity'] * data['UnitPrice'] #매출 = 수량 * 개당 가격
data.head()

03. 분석

시간의 흐름에 따라 매출, 주문고객수 주문단가의 추이

매출

amount_by_date = data.groupby('date_ymd')[['amount']].sum().reset_index()

fig = px.line(data_frame=amount_by_date, x='date_ymd', y='amount')
fig.show()

주문고객수
[[250312 온라인강의#11. 데이터 집계_groupby#groupby]]

customer_count_by_date = data.groupby('date_ymd')[['CustomerID']].nunique().reset_index().rename({'CustomerID':'customer_count'}, axis=1)  # .nunique(): 유니크한 행의 갯수

fig = px.line(data_frame=customer_count_by_date, x='date_ymd', y='customer_count')
fig.show()

주문단가

# 일주문수
invoice_count_by_date = data.groupby('date_ymd')[['InvoiceNo']].nunique().reset_index().rename({'InvoiceNo':'invoice_count'}, axis=1)

# 일주문총액/일주문수=주문단가
invoice_amount = pd.merge(amount_by_date, invoice_count_by_date, on='date_ymd')
invoice_amount['amount_per_invoice'] = invoice_amount['amount'] / invoice_amount['invoice_count']

# plot
fig = px.line(data_frame=invoice_amount, x='date_ymd', y='amount_per_invoice')
fig.show()

04. 리텐션 분석

년월 단위로 고객번호, 영수증번호 전처리

# ["CustomerID", "InvoiceNo", "date_ymd"] 모두 같은 경우 없앤다
retention_base = data[["CustomerID", "InvoiceNo", "date_ymd"]].drop_duplicates

# 월 주기로 'date_ym' 열(period[M] type) 추가 
retention_base['date_ym'] = retention_base['date_ymd'].dt.to_period('M')
retention_base.head()

날짜 범위 수정

# 12월 데이터를 포함하면 2011년 12월 데이터는 리텐션이 낮을 수 밖에 없으므로 12월 데이터 제외
retention_base = retention_base.query('date_ymd <= "2011-11-30"')

리텐션 계산

# 주기 리스트 생성
date_ym_list = sorted(list(retention_base['date_ym'].unique()))

# start, target 주기 값 예시
period_start = date_ym_list[2]
period_target = date_ym_list[3]

# .query로 date_ym 이 주어진 주기(@***)일 때 retention_base의 유저들을 set으로 묶어서 고유 유저들의 집합으로
period_start_users = set(retention_base.query('date_ym == @period_start')['CustomerID'])
period_target_users = set(retention_base.query('date_ym == @period_target')['CustomerID'])

# retained_users 는 period_start_users와 period_target_users 교집합
retained_users = period_start_users.intersection(period_target_users)
retention_rate = len(retained_users) / len(period_start_users)
print(retention_rate)

date_ym_list = sorted(list(retention_base['date_ym'].unique()))

from tqdm.notebook import tqdm

#빈 retention dataframe 생성
retention = pd.DataFrame()

for s in tqdm(date_ym_list):
    for t in date_ym_list:
        period_start = s
        period_target = t

# period_target이 period_start보다 클 때만 retention_rate를 구하고

        if period_start <= period_target: 
            period_start_users = set(retention_base.query('date_ym == @period_start')['CustomerID'])
            period_target_users = set(retention_base.query('date_ym == @period_target')['CustomerID'])

            retained_users = period_start_users.intersection(period_target_users) 

            retention_rate = len(retained_users) / len(period_start_users)

# temp에 period_start, period_target, retention_rate 갚을 아래 key들에 저장

            temp = pd.DataFrame({'cohort':[period_start], 'date_ym':[period_target], 'retention_rate':[retention_rate]})

# 반복문이 실행되는 동안 concat으로 retention에 temp를 추가
            retention = pd.concat([retention, temp])

# cohort_size(month) 열 추가
retention['cohort_size(month)'] = retention.apply(lambda x: (x['date_ym'] - x['cohort']).n, axis=1) # (두 주기 차이).n = 주기의 수, 앞에서 주기를 나타내는 값들을 period[M] type으로 정의 했기 때문에 .n으로 주기끼리의 연산한 결과를 얻을 수 있다.
retention.head()

# cokort와 date_ym은 열 이름으로 써야하기 때문에 string으로 바꾼다. (안하면 뒤에 type 에러)
retention['cohort'] = retention['cohort'].astype(str)
retention['date_ym'] = retention['date_ym'].astype(str)

retention_final = pd.pivot_table(data=retention, index='cohort', columns='cohort_size(month)', values='retention_rate')

fig = px.imshow(retention_final, text_auto='.2%', color_continuous_scale='Burg')
fig.show()

리텐션 커브

retention_curve = retention.groupby('cohort_size(month)')[['retention_rate']].mean().reset_index()

fig = px.line(data_frame = retention_curve, x='cohort_size(month)', y='retention_rate', title='리텐션 커브')
fig.update_yaxes(tickformat='.2%')
fig.show()

05. RFM 분석

RFM 분석이란?
- Recency, Frequency, Monetary를 기반으로 고객을 유형화하는 방법
  - Recency (최근성): 고객이 얼마나 최근에 구매를 했는지
  - Frequency (빈도): 고객이 얼마나 자주 구매를 하는지
  - Monetary (금액): 고객이 구매한 총 금액
- 고객 유형을 세분화하여 맞춤형 전략을 구상할 수 있다.
  - 예: 총 구매금액은 낮지만 자주 방문하는 유저 vs 최근에 큰 금액을 구매했지만 자주 방문하지는 않았던 유저
RM 계산

today_date = max(data['date_ymd']) # 가장 최근

rfm = data.groupby('CustomerID').agg({'InvoiceDate': lambda x: (today_date - x.max()).days, #오늘로부터 며칠이 지났는지
                                    'amount': lambda x: x.sum()}) #주문금액

rfm.columns = ['recency', 'monetary']
rfm.head()

각 팩터를 여러등급으로 나누어 등급을 매긴다
- pd.qcut(컬럼, 등급개수, 라벨)

pd.qcut(rfm["recency"], 5, labels=[5, 4, 3, 2, 1])

rfm['recency_score'] = pd.qcut(rfm["recency"], 3, labels=[3, 2, 1])
rfm['monetary_score'] = pd.qcut(rfm["monetary"], 3, labels=[1, 2, 3])
rfm['rm_score'] = rfm['recency_score'].astype(str) + rfm['monetary_score'].astype(str)
rfm.reset_index(inplace=True)
rfm

rm_score = rfm.groupby('rm_score')[['CustomerID']].nunique().reset_index().rename({'CustomerID':'customer_count'}, axis=1)

rm_score

def categorize_customer(score):
    if score == '33':
        return '최우수' #최신성, 구매 모두 상당히 높음
    elif score in ['32','23','22']:
        return '우수' #최신성, 구매 모두 높음
    elif score =='11':
        return '휴면' #최신성, 구매 모두 낮음
    elif score in ['12','13']:
        return '이탈 방지' #구매는 높으나 최신성은 낮음 -> 다시 불러들어야 함
    elif score in ['31','21']:
        return '구매 유도' #최신성은 높으나 구매는 낮음 -> 구매를 유도해야 함

rm_score['category'] = rm_score['rm_score'].apply(categorize_customer)

fig = px.treemap(data_frame = rm_score, path=['category'], values='customer_count', color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.show()

저작자표시 비영리 변경금지 (새창열림)

'데이터분석 부트캠프' 카테고리의 다른 글

250318 온라인강의_퍼널분석, K-Means Clustering, 크롤링(user-agent) (0)	2025.03.18
250317 실시간 강의_웹 크롤링 (Selenium, API) (0)	2025.03.17
250313 실시간강의_자료형, 함수, 클래스 (0)	2025.03.13
250312 온라인강의_데이터 전처리, 시각화 (0)	2025.03.12
250311 실시간강의_파이썬 시작~튜플 (0)	2025.03.11

minimin227 님의 블로그

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

Part 2. 파이썬을 이용한 데이터 분석

Ch 02. 데이터 시각화

03. plotly 이해하기

기본 문법

스타일 설정하기

HTML 파일로 저장하기

04. plotly로 유형별 그래프 그리기

산점도 (scatter plot)

히스토그램 (histogram)

상자그림 (boxplot)

막대 그래프 (bar plot)

선 그래프 (line plot)

히트맵 (heat map)

파이 차트 (pie chart)

05. folium 이해하기

특정 장소의 지도 시각화 하기

마커 추가하기

folium으로 지리 데이터 시각화하기

서울시 구별 경계 데이터 가져오기

카페별로 데이터를 전처리하고 EDA

forlium으로 지도에 시각화

Part 3. 데이터 분석 프로젝트

Ch 01. 데이터 수집

01. 다양한 공개데이터 플랫폼 살펴보기

02. 웹 크롤링

Pandas 활용하기

BeautifulSoup 활용하기

Ch 02. 영화 데이터를 활용한 영화 흥행 요인 분석

01. 데이터 둘러보기 & 질문 만들기

질문 만들기

02. 데이터 전처리

03. EDA, 시각화, 분석 (1)

연도별 흥행 수익

가장 흥행한 영화 TOP 10

예산, 투표수 TOP 10

흥행에 가장 성공한 감독과 배우

장르와 흥행 수익

수익, 예산, 득표수, 평점과의 상관관계

ROI(예산 대비 수익)가 높으면서 흥행에 성공한 영화의 특징

Ch 03. 유통 데이터 - 리텐션과 RFM 분석

01. 데이터 둘러보기 & 질문 만들기

질문 만들기

02. 데이터 전처리

03. 분석

시간의 흐름에 따라 매출, 주문고객수 주문단가의 추이

04. 리텐션 분석

05. RFM 분석

'데이터분석 부트캠프' 카테고리의 다른 글

관련글 더보기

추가 정보

인기글

최신글

티스토리툴바