상관계수 놀이 : http://guessthecorrelation.com/

1. 선형회귀분석

  • 단순선형회귀 : $Y$ = $\beta_0$ + $\beta_1$$X$

  • 다중선형회귀 : $Y$ = $\beta_0$ + $\beta_2$$X1_1$ + ... + $\beta_p$$X_p$

일반화 선형 모형 중 하나이며, 종속 변수가 수치형 자료인 경우 우선적으로 고려하는 모형

보통 다중선형회귀(Multiple Linear Regression)을 칭하며, 종속변수가 하나인 경우는 단순 선형 회귀이다.

예시 : 사업장 매출 예측, 에너지 소비량 예측

회귀 분석의 표준 가정

  1. 독립 변수와 종속 변수 간의 선형성
  2. 잔차의 등분산성
  3. 종속 변수의 독립성
  4. 종속 변수의 정규성

https://rstudio-pubs-static.s3.amazonaws.com/190997_40fa09db8e344b19b14a687ea5de914b.html

  • 설명력($R^2$)

$R^2$는 총변동(TTS, Total Sum of Squares)에서 설명된 부분

In [1]:
import pandas as pd

from sklearn import datasets, linear_model

import statsmodels.formula.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.font_manager as fm

import seaborn as sns
from seaborn.linearmodels import corrplot,symmatplot

import numpy as np

import os

import math

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline
In [2]:
os.chdir('/Users/jinseokryu/Desktop/ML강의자료/대구대학교/linear_regression/')

데이터 실습

  • 데이터 설명

172개 자전거 대여 업체의 마케팅 데이터

  • 변수 설명

google_adwords : 구글

AdWords, facebook : 페이스북 광고

twitter : 트위터 광고 등에 대한 비용

marketing_total : 총 마케팅 예산

revenues : 매출

employees : 종업원수

pop_density : 타켓 시장의 인구밀도 수준

비용은 1 = 1000$을 뜻한다.

In [3]:
marketing = pd.read_csv("./marketing.csv")
In [4]:
marketing = marketing.drop(['employees'],axis=1)
In [5]:
marketing.head()
Out[5]:
google_adwords facebook twitter marketing_total revenues pop_density
0 65.66 47.86 52.46 165.98 39.26 High
1 39.10 55.20 77.40 171.70 38.90 Medium
2 174.81 52.01 68.01 294.83 49.51 Medium
3 34.36 61.96 86.86 183.18 40.56 High
4 78.21 40.91 30.41 149.53 40.21 Low
In [6]:
marketing.describe()
Out[6]:
google_adwords facebook twitter marketing_total revenues
count 172.000000 172.000000 172.000000 172.000000 172.000000
mean 169.868488 33.869651 38.982442 242.720581 44.610930
std 87.472279 15.270010 21.962255 95.859483 5.835498
min 23.650000 8.000000 5.890000 53.650000 30.450000
25% 97.247500 19.367500 20.937500 158.415000 40.327500
50% 169.475000 33.655000 34.595000 245.565000 43.995000
75% 243.105000 47.805000 52.937500 322.615000 48.612500
max 321.000000 62.170000 122.190000 481.000000 58.380000
In [7]:
sns.pairplot(marketing)
Out[7]:
<seaborn.axisgrid.PairGrid at 0x1108abba8>
In [8]:
marketing.corr()
Out[8]:
google_adwords facebook twitter marketing_total revenues
google_adwords 1.000000 0.076432 0.098975 0.947357 0.766246
facebook 0.076432 1.000000 0.354341 0.310223 0.577821
twitter 0.098975 0.354341 1.000000 0.375869 0.269685
marketing_total 0.947357 0.310223 0.375869 1.000000 0.853035
revenues 0.766246 0.577821 0.269685 0.853035 1.000000
In [9]:
corrplot(marketing)
/Users/jinseokryu/anaconda/envs/tensorflow/lib/python3.5/site-packages/seaborn/linearmodels.py:1290: UserWarning: The `corrplot` function has been deprecated in favor of `heatmap` and will be removed in a forthcoming release. Please update your code.
  warnings.warn(("The `corrplot` function has been deprecated in favor "
/Users/jinseokryu/anaconda/envs/tensorflow/lib/python3.5/site-packages/seaborn/linearmodels.py:1356: UserWarning: The `symmatplot` function has been deprecated in favor of `heatmap` and will be removed in a forthcoming release. Please update your code.
  warnings.warn(("The `symmatplot` function has been deprecated in favor "
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1ddd1ba8>
In [10]:
model = sm.ols(formula = 'revenues ~ marketing_total', data = marketing)
result = model.fit()
# 요약결과 출력
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               revenues   R-squared:                       0.728
Model:                            OLS   Adj. R-squared:                  0.726
Method:                 Least Squares   F-statistic:                     454.2
Date:                Fri, 29 Jun 2018   Prob (F-statistic):           6.88e-50
Time:                        15:44:14   Log-Likelihood:                -435.09
No. Observations:                 172   AIC:                             874.2
Df Residuals:                     170   BIC:                             880.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          32.0067      0.636     50.357      0.000      30.752      33.261
marketing_total     0.0519      0.002     21.313      0.000       0.047       0.057
==============================================================================
Omnibus:                        1.845   Durbin-Watson:                   1.993
Prob(Omnibus):                  0.397   Jarque-Bera (JB):                1.550
Skew:                          -0.226   Prob(JB):                        0.461
Kurtosis:                       3.105   Cond. No.                         712.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [11]:
x, y = marketing["marketing_total"],marketing["revenues"]

# or via jointplot (with histograms aside):
sns.jointplot(x, y, kind='scatter', joint_kws={'alpha':0.5})
Out[11]:
<seaborn.axisgrid.JointGrid at 0x1a1ed0f0f0>
In [12]:
plt.scatter(x,y)
plt.plot(x,result.predict(marketing[["marketing_total"]]),'r')
plt.ylabel('marketing_total')
plt.xlabel('revenues')
Out[12]:
<matplotlib.text.Text at 0x1a1ee28240>
In [13]:
model = sm.ols(formula = 'revenues ~ google_adwords', data = marketing)
result = model.fit()
# 요약결과 출력
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               revenues   R-squared:                       0.587
Model:                            OLS   Adj. R-squared:                  0.585
Method:                 Least Squares   F-statistic:                     241.8
Date:                Fri, 29 Jun 2018   Prob (F-statistic):           1.75e-34
Time:                        15:44:15   Log-Likelihood:                -470.88
No. Observations:                 172   AIC:                             945.8
Df Residuals:                     170   BIC:                             952.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         35.9276      0.628     57.229      0.000      34.688      37.167
google_adwords     0.0511      0.003     15.548      0.000       0.045       0.058
==============================================================================
Omnibus:                        1.725   Durbin-Watson:                   1.958
Prob(Omnibus):                  0.422   Jarque-Bera (JB):                1.555
Skew:                          -0.112   Prob(JB):                        0.460
Kurtosis:                       2.592   Cond. No.                         418.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [14]:
model = sm.ols(formula = 'revenues ~ facebook', data = marketing)
result = model.fit()
# 요약결과 출력
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               revenues   R-squared:                       0.334
Model:                            OLS   Adj. R-squared:                  0.330
Method:                 Least Squares   F-statistic:                     85.21
Date:                Fri, 29 Jun 2018   Prob (F-statistic):           1.05e-16
Time:                        15:44:15   Log-Likelihood:                -512.02
No. Observations:                 172   AIC:                             1028.
Df Residuals:                     170   BIC:                             1034.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     37.1319      0.888     41.800      0.000      35.378      38.885
facebook       0.2208      0.024      9.231      0.000       0.174       0.268
==============================================================================
Omnibus:                       12.339   Durbin-Watson:                   1.809
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               13.530
Skew:                          -0.687   Prob(JB):                      0.00115
Kurtosis:                       2.973   Cond. No.                         90.6
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [15]:
model = sm.ols(formula = 'revenues ~ twitter', data = marketing)
result = model.fit()
# 요약결과 출력
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               revenues   R-squared:                       0.073
Model:                            OLS   Adj. R-squared:                  0.067
Method:                 Least Squares   F-statistic:                     13.33
Date:                Fri, 29 Jun 2018   Prob (F-statistic):           0.000347
Time:                        15:44:15   Log-Likelihood:                -540.46
No. Observations:                 172   AIC:                             1085.
Df Residuals:                     170   BIC:                             1091.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     41.8176      0.877     47.660      0.000      40.086      43.550
twitter        0.0717      0.020      3.652      0.000       0.033       0.110
==============================================================================
Omnibus:                        4.051   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.132   Jarque-Bera (JB):                3.473
Skew:                           0.256   Prob(JB):                        0.176
Kurtosis:                       2.528   Cond. No.                         91.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [16]:
model1 = sm.ols(formula = 'revenues ~ google_adwords + facebook', data = marketing)
result1 = model1.fit()
# 요약결과 출력
print(result1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               revenues   R-squared:                       0.858
Model:                            OLS   Adj. R-squared:                  0.857
Method:                 Least Squares   F-statistic:                     512.0
Date:                Fri, 29 Jun 2018   Prob (F-statistic):           1.90e-72
Time:                        15:44:16   Log-Likelihood:                -378.88
No. Observations:                 172   AIC:                             763.8
Df Residuals:                     169   BIC:                             773.2
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         29.6195      0.509     58.201      0.000      28.615      30.624
google_adwords     0.0485      0.002     25.014      0.000       0.045       0.052
facebook           0.1996      0.011     17.988      0.000       0.178       0.222
==============================================================================
Omnibus:                        5.844   Durbin-Watson:                   2.023
Prob(Omnibus):                  0.054   Jarque-Bera (JB):                5.718
Skew:                          -0.402   Prob(JB):                       0.0573
Kurtosis:                       2.611   Cond. No.                         584.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [17]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

dfX0 = marketing[["google_adwords","facebook"]]
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(dfX0.values, i) for i in range(dfX0.shape[1])]
vif["features"] = dfX0.columns
vif
Out[17]:
VIF Factor features
0 3.142389 google_adwords
1 3.142389 facebook
In [18]:
model = sm.ols(formula = 'revenues ~ google_adwords + facebook + twitter', data = marketing)
result = model.fit()
# 요약결과 출력
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               revenues   R-squared:                       0.859
Model:                            OLS   Adj. R-squared:                  0.856
Method:                 Least Squares   F-statistic:                     339.8
Date:                Fri, 29 Jun 2018   Prob (F-statistic):           4.36e-71
Time:                        15:44:16   Log-Likelihood:                -378.77
No. Observations:                 172   AIC:                             765.5
Df Residuals:                     168   BIC:                             778.1
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         29.5460      0.534     55.379      0.000      28.493      30.599
google_adwords     0.0484      0.002     24.846      0.000       0.045       0.052
facebook           0.1977      0.012     16.650      0.000       0.174       0.221
twitter            0.0039      0.008      0.470      0.639      -0.012       0.020
==============================================================================
Omnibus:                        5.684   Durbin-Watson:                   2.015
Prob(Omnibus):                  0.058   Jarque-Bera (JB):                5.471
Skew:                          -0.386   Prob(JB):                       0.0649
Kurtosis:                       2.592   Cond. No.                         622.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [19]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

dfX0 = marketing[["google_adwords","facebook","twitter"]]
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(dfX0.values, i) for i in range(dfX0.shape[1])]
vif["features"] = dfX0.columns
vif
Out[19]:
VIF Factor features
0 3.436268 google_adwords
1 5.009430 facebook
2 4.384159 twitter
In [20]:
train = marketing[:int(len(marketing)*0.7)]
test = marketing[int(len(marketing)*0.7):]
In [21]:
y_fit = result1.predict(test[['google_adwords','facebook']])
In [22]:
x_surf, y_surf = np.meshgrid(np.linspace(marketing.google_adwords.min(), marketing.google_adwords.max(), 100),np.linspace(marketing.facebook.min(), marketing.facebook.max(), 100))
onlyX = pd.DataFrame({'google_adwords': x_surf.ravel(), 'facebook': y_surf.ravel()})
fittedY=result1.predict(exog=onlyX)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(x_surf,y_surf,fittedY.reshape(x_surf.shape), color='r', alpha=0.7)
ax.scatter(train['google_adwords'],train['facebook'],train['revenues'],c='blue', marker='o', alpha=0.5)
ax.scatter(test['google_adwords'],test['facebook'],test['revenues'],c='green', marker='o', alpha=0.5)
ax.set_xlabel('google_adwords')
ax.set_ylabel('facebook')
ax.set_zlabel('revenues')
ax.view_init(azim=50)
/Users/jinseokryu/anaconda/envs/tensorflow/lib/python3.5/site-packages/ipykernel_launcher.py:6: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  
In [23]:
sns.pairplot(marketing, hue='pop_density',size=3)
Out[23]:
<seaborn.axisgrid.PairGrid at 0x1a1ef4fba8>
In [24]:
sns.factorplot(kind='box', y='value', x='variable', hue='pop_density',
               data=pd.melt(marketing, id_vars=['pop_density'], value_vars=['google_adwords', 'facebook']), size=8, aspect=1.5, legend_out=False) 
Out[24]:
<seaborn.axisgrid.FacetGrid at 0x1a1f072550>

<데이터 분석 결과 요약 정리>

인구밀도가 높은 곳을 타겟팅하는 자전거 대여 업체에서는 facebook광고를 사용하는 것이 낫고,

인구밀도가 낮은 곳을 타겟팅하는 자전거 대여 업체에서는 google_adwords광고를 사용하는 것이 낫다.

단, facebook광고는 1(1000$) 증가할 때마다 0.19씩 수익이 증가하고, google_adwords광고는 1(1000$) 증가할 때마다 0.04씩 수익이 증가하기 때문에 가성비가 좋은 것은 facebook광고이다.

그러나 google_adwords와 수익간의 상관성이 높기 때문에 google_adwords에 투자하면 수익이 증가할 것이라는 안전성은 확보된다.