상관계수 놀이 : http://guessthecorrelation.com/

1. 선형회귀분석¶

단순선형회귀 : $Y$ = $\beta_0$ + $\beta_1$$X$
다중선형회귀 : $Y$ = $\beta_0$ + $\beta_2$$X1_1$ + ... + $\beta_p$$X_p$

일반화 선형 모형 중 하나이며, 종속 변수가 수치형 자료인 경우 우선적으로 고려하는 모형

보통 다중선형회귀(Multiple Linear Regression)을 칭하며, 종속변수가 하나인 경우는 단순 선형 회귀이다.

예시 : 사업장 매출 예측, 에너지 소비량 예측

회귀 분석의 표준 가정¶

독립 변수와 종속 변수 간의 선형성
잔차의 등분산성
종속 변수의 독립성
종속 변수의 정규성

https://rstudio-pubs-static.s3.amazonaws.com/190997_40fa09db8e344b19b14a687ea5de914b.html

설명력($R^2$)

$R^2$는 총변동(TTS, Total Sum of Squares)에서 설명된 부분

import pandas as pd

from sklearn import datasets, linear_model

import statsmodels.formula.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.font_manager as fm

import seaborn as sns
from seaborn.linearmodels import corrplot,symmatplot

import numpy as np

import os

import math

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline

os.chdir('/Users/jinseokryu/Desktop/ML강의자료/대구대학교/linear_regression/')

데이터 실습¶

데이터 설명

172개 자전거 대여 업체의 마케팅 데이터

변수 설명

google_adwords : 구글

AdWords, facebook : 페이스북 광고

twitter : 트위터 광고 등에 대한 비용

marketing_total : 총 마케팅 예산

revenues : 매출

employees : 종업원수

pop_density : 타켓 시장의 인구밀도 수준

비용은 1 = 1000$을 뜻한다.¶

marketing = pd.read_csv("./marketing.csv")

marketing = marketing.drop(['employees'],axis=1)

marketing.head()

marketing.describe()

sns.pairplot(marketing)

<seaborn.axisgrid.PairGrid at 0x1108abba8>

marketing.corr()

corrplot(marketing)

/Users/jinseokryu/anaconda/envs/tensorflow/lib/python3.5/site-packages/seaborn/linearmodels.py:1290: UserWarning: The `corrplot` function has been deprecated in favor of `heatmap` and will be removed in a forthcoming release. Please update your code.
  warnings.warn(("The `corrplot` function has been deprecated in favor "
/Users/jinseokryu/anaconda/envs/tensorflow/lib/python3.5/site-packages/seaborn/linearmodels.py:1356: UserWarning: The `symmatplot` function has been deprecated in favor of `heatmap` and will be removed in a forthcoming release. Please update your code.
  warnings.warn(("The `symmatplot` function has been deprecated in favor "

<matplotlib.axes._subplots.AxesSubplot at 0x1a1ddd1ba8>

model = sm.ols(formula = 'revenues ~ marketing_total', data = marketing)
result = model.fit()
# 요약결과 출력
print(result.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               revenues   R-squared:                       0.728
Model:                            OLS   Adj. R-squared:                  0.726
Method:                 Least Squares   F-statistic:                     454.2
Date:                Fri, 29 Jun 2018   Prob (F-statistic):           6.88e-50
Time:                        15:44:14   Log-Likelihood:                -435.09
No. Observations:                 172   AIC:                             874.2
Df Residuals:                     170   BIC:                             880.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          32.0067      0.636     50.357      0.000      30.752      33.261
marketing_total     0.0519      0.002     21.313      0.000       0.047       0.057
==============================================================================
Omnibus:                        1.845   Durbin-Watson:                   1.993
Prob(Omnibus):                  0.397   Jarque-Bera (JB):                1.550
Skew:                          -0.226   Prob(JB):                        0.461
Kurtosis:                       3.105   Cond. No.                         712.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

x, y = marketing["marketing_total"],marketing["revenues"]

# or via jointplot (with histograms aside):
sns.jointplot(x, y, kind='scatter', joint_kws={'alpha':0.5})

<seaborn.axisgrid.JointGrid at 0x1a1ed0f0f0>

plt.scatter(x,y)
plt.plot(x,result.predict(marketing[["marketing_total"]]),'r')
plt.ylabel('marketing_total')
plt.xlabel('revenues')

<matplotlib.text.Text at 0x1a1ee28240>

model = sm.ols(formula = 'revenues ~ google_adwords', data = marketing)
result = model.fit()
# 요약결과 출력
print(result.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               revenues   R-squared:                       0.587
Model:                            OLS   Adj. R-squared:                  0.585
Method:                 Least Squares   F-statistic:                     241.8
Date:                Fri, 29 Jun 2018   Prob (F-statistic):           1.75e-34
Time:                        15:44:15   Log-Likelihood:                -470.88
No. Observations:                 172   AIC:                             945.8
Df Residuals:                     170   BIC:                             952.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         35.9276      0.628     57.229      0.000      34.688      37.167
google_adwords     0.0511      0.003     15.548      0.000       0.045       0.058
==============================================================================
Omnibus:                        1.725   Durbin-Watson:                   1.958
Prob(Omnibus):                  0.422   Jarque-Bera (JB):                1.555
Skew:                          -0.112   Prob(JB):                        0.460
Kurtosis:                       2.592   Cond. No.                         418.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

model = sm.ols(formula = 'revenues ~ facebook', data = marketing)
result = model.fit()
# 요약결과 출력
print(result.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               revenues   R-squared:                       0.334
Model:                            OLS   Adj. R-squared:                  0.330
Method:                 Least Squares   F-statistic:                     85.21
Date:                Fri, 29 Jun 2018   Prob (F-statistic):           1.05e-16
Time:                        15:44:15   Log-Likelihood:                -512.02
No. Observations:                 172   AIC:                             1028.
Df Residuals:                     170   BIC:                             1034.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     37.1319      0.888     41.800      0.000      35.378      38.885
facebook       0.2208      0.024      9.231      0.000       0.174       0.268
==============================================================================
Omnibus:                       12.339   Durbin-Watson:                   1.809
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               13.530
Skew:                          -0.687   Prob(JB):                      0.00115
Kurtosis:                       2.973   Cond. No.                         90.6
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

model = sm.ols(formula = 'revenues ~ twitter', data = marketing)
result = model.fit()
# 요약결과 출력
print(result.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               revenues   R-squared:                       0.073
Model:                            OLS   Adj. R-squared:                  0.067
Method:                 Least Squares   F-statistic:                     13.33
Date:                Fri, 29 Jun 2018   Prob (F-statistic):           0.000347
Time:                        15:44:15   Log-Likelihood:                -540.46
No. Observations:                 172   AIC:                             1085.
Df Residuals:                     170   BIC:                             1091.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     41.8176      0.877     47.660      0.000      40.086      43.550
twitter        0.0717      0.020      3.652      0.000       0.033       0.110
==============================================================================
Omnibus:                        4.051   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.132   Jarque-Bera (JB):                3.473
Skew:                           0.256   Prob(JB):                        0.176
Kurtosis:                       2.528   Cond. No.                         91.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

model1 = sm.ols(formula = 'revenues ~ google_adwords + facebook', data = marketing)
result1 = model1.fit()
# 요약결과 출력
print(result1.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               revenues   R-squared:                       0.858
Model:                            OLS   Adj. R-squared:                  0.857
Method:                 Least Squares   F-statistic:                     512.0
Date:                Fri, 29 Jun 2018   Prob (F-statistic):           1.90e-72
Time:                        15:44:16   Log-Likelihood:                -378.88
No. Observations:                 172   AIC:                             763.8
Df Residuals:                     169   BIC:                             773.2
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         29.6195      0.509     58.201      0.000      28.615      30.624
google_adwords     0.0485      0.002     25.014      0.000       0.045       0.052
facebook           0.1996      0.011     17.988      0.000       0.178       0.222
==============================================================================
Omnibus:                        5.844   Durbin-Watson:                   2.023
Prob(Omnibus):                  0.054   Jarque-Bera (JB):                5.718
Skew:                          -0.402   Prob(JB):                       0.0573
Kurtosis:                       2.611   Cond. No.                         584.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

from statsmodels.stats.outliers_influence import variance_inflation_factor

dfX0 = marketing[["google_adwords","facebook"]]
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(dfX0.values, i) for i in range(dfX0.shape[1])]
vif["features"] = dfX0.columns
vif

model = sm.ols(formula = 'revenues ~ google_adwords + facebook + twitter', data = marketing)
result = model.fit()
# 요약결과 출력
print(result.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               revenues   R-squared:                       0.859
Model:                            OLS   Adj. R-squared:                  0.856
Method:                 Least Squares   F-statistic:                     339.8
Date:                Fri, 29 Jun 2018   Prob (F-statistic):           4.36e-71
Time:                        15:44:16   Log-Likelihood:                -378.77
No. Observations:                 172   AIC:                             765.5
Df Residuals:                     168   BIC:                             778.1
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         29.5460      0.534     55.379      0.000      28.493      30.599
google_adwords     0.0484      0.002     24.846      0.000       0.045       0.052
facebook           0.1977      0.012     16.650      0.000       0.174       0.221
twitter            0.0039      0.008      0.470      0.639      -0.012       0.020
==============================================================================
Omnibus:                        5.684   Durbin-Watson:                   2.015
Prob(Omnibus):                  0.058   Jarque-Bera (JB):                5.471
Skew:                          -0.386   Prob(JB):                       0.0649
Kurtosis:                       2.592   Cond. No.                         622.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

from statsmodels.stats.outliers_influence import variance_inflation_factor

dfX0 = marketing[["google_adwords","facebook","twitter"]]
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(dfX0.values, i) for i in range(dfX0.shape[1])]
vif["features"] = dfX0.columns
vif

train = marketing[:int(len(marketing)*0.7)]
test = marketing[int(len(marketing)*0.7):]

y_fit = result1.predict(test[['google_adwords','facebook']])

x_surf, y_surf = np.meshgrid(np.linspace(marketing.google_adwords.min(), marketing.google_adwords.max(), 100),np.linspace(marketing.facebook.min(), marketing.facebook.max(), 100))
onlyX = pd.DataFrame({'google_adwords': x_surf.ravel(), 'facebook': y_surf.ravel()})
fittedY=result1.predict(exog=onlyX)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(x_surf,y_surf,fittedY.reshape(x_surf.shape), color='r', alpha=0.7)
ax.scatter(train['google_adwords'],train['facebook'],train['revenues'],c='blue', marker='o', alpha=0.5)
ax.scatter(test['google_adwords'],test['facebook'],test['revenues'],c='green', marker='o', alpha=0.5)
ax.set_xlabel('google_adwords')
ax.set_ylabel('facebook')
ax.set_zlabel('revenues')
ax.view_init(azim=50)

/Users/jinseokryu/anaconda/envs/tensorflow/lib/python3.5/site-packages/ipykernel_launcher.py:6: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead

sns.pairplot(marketing, hue='pop_density',size=3)

<seaborn.axisgrid.PairGrid at 0x1a1ef4fba8>

sns.factorplot(kind='box', y='value', x='variable', hue='pop_density',
               data=pd.melt(marketing, id_vars=['pop_density'], value_vars=['google_adwords', 'facebook']), size=8, aspect=1.5, legend_out=False)

<seaborn.axisgrid.FacetGrid at 0x1a1f072550>

<데이터 분석 결과 요약 정리>

인구밀도가 높은 곳을 타겟팅하는 자전거 대여 업체에서는 facebook광고를 사용하는 것이 낫고,

인구밀도가 낮은 곳을 타겟팅하는 자전거 대여 업체에서는 google_adwords광고를 사용하는 것이 낫다.

단, facebook광고는 1(1000$) 증가할 때마다 0.19씩 수익이 증가하고, google_adwords광고는 1(1000$) 증가할 때마다 0.04씩 수익이 증가하기 때문에 가성비가 좋은 것은 facebook광고이다.

그러나 google_adwords와 수익간의 상관성이 높기 때문에 google_adwords에 투자하면 수익이 증가할 것이라는 안전성은 확보된다.

	google_adwords	facebook	twitter	marketing_total	revenues	pop_density
0	65.66	47.86	52.46	165.98	39.26	High
1	39.10	55.20	77.40	171.70	38.90	Medium
2	174.81	52.01	68.01	294.83	49.51	Medium
3	34.36	61.96	86.86	183.18	40.56	High
4	78.21	40.91	30.41	149.53	40.21	Low

	google_adwords	facebook	twitter	marketing_total	revenues
count	172.000000	172.000000	172.000000	172.000000	172.000000
mean	169.868488	33.869651	38.982442	242.720581	44.610930
std	87.472279	15.270010	21.962255	95.859483	5.835498
min	23.650000	8.000000	5.890000	53.650000	30.450000
25%	97.247500	19.367500	20.937500	158.415000	40.327500
50%	169.475000	33.655000	34.595000	245.565000	43.995000
75%	243.105000	47.805000	52.937500	322.615000	48.612500
max	321.000000	62.170000	122.190000	481.000000	58.380000

	google_adwords	facebook	twitter	marketing_total	revenues
google_adwords	1.000000	0.076432	0.098975	0.947357	0.766246
facebook	0.076432	1.000000	0.354341	0.310223	0.577821
twitter	0.098975	0.354341	1.000000	0.375869	0.269685
marketing_total	0.947357	0.310223	0.375869	1.000000	0.853035
revenues	0.766246	0.577821	0.269685	0.853035	1.000000

	VIF Factor	features
0	3.142389	google_adwords
1	3.142389	facebook

	VIF Factor	features
0	3.436268	google_adwords
1	5.009430	facebook
2	4.384159	twitter