[머신러닝과 딥러닝] 7. 로지스틱 회귀

ITselfhiam

|2024. 1. 2. 14:36

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

hr_df = pd.read_csv('/content/drive/MyDrive/KDT v2/머신러닝과 딥러닝/ data/hr.csv')
hr_df.head()

hr_df.info()

# 결과값 => 

# 0   employee_id           54808 non-null  int64  
# employee_id: 임의의 직원 아이디

# 1   department            54808 non-null  object 
# department: 부서

# 2   region                54808 non-null  object 
# region: 지역

# 3   education             52399 non-null  object 
# education: 학력

# 4   gender                54808 non-null  object 
# gender: 성별

# 5   recruitment_channel   54808 non-null  object 
# recruitment_channel: 채용 방법

# 6   no_of_trainings       54808 non-null  int64  
# no_of_trainings: 트레이닝 받은 횟수

# 7   age                   54808 non-null  int64  
# age: 나이

# 8   previous_year_rating  50684 non-null  float64
# previous_year_rating: 이전 년도 고과 점수

# 9   length_of_service     54808 non-null  int64  
# length_of_service: 근속 년수

# 10  awards_won            54808 non-null  int64  
# awards_won: 수상 경력

# 11  avg_training_score    54808 non-null  int64  
# avg_training_score: 평균 고과 점수

# 12  is_promoted           54808 non-null  int64
# employee_id: 임의의 직원 아이디

hr_df.describe()

sns.barplot(x='previous_year_rating', y='is_promoted', data=hr_df)

sns.lineplot(x='previous_year_rating', y='is_promoted', data=hr_df)

sns.lineplot(x='avg_training_score', y='is_promoted', data=hr_df)

sns.barplot(x='recruitment_channel', y='is_promoted', data=hr_df)
# sourcing : 외주 및 협력 업체
# referred : 추천서
# 이렇게 보면 추천으로 들어온 사람이 승진률이 높다고 볼 수 있지만 아니다.
# 이유는 아래를 봐라.
# 심지가 길면 길수록 오차 확률이 높다.

hr_df["recruitment_channel"].value_counts()

sns.barplot(x='gender', y='is_promoted', data=hr_df)

hr_df["gender"].value_counts()

sns.barplot(x='department', y='is_promoted', data=hr_df)
plt.xticks(rotation=45)

hr_df["department"].value_counts()

plt.figure(figsize=(14, 10))
sns.barplot(x='region', y='is_promoted', data=hr_df)
plt.xticks(rotation=45)

hr_df.isna().mean()

hr_df['education'].value_counts()
# Bachelor's : 학사
# Master's : 석사
# Below Secondary : 고졸

hr_df['previous_year_rating'].value_counts()

# NaN값 제거
hr_df = hr_df.dropna()

for i in ['department', 'region', 'education', 'gender', 'recruitment_channel'] :
    print(i, hr_df[i].nunique())

# 원 핫 인코딩
hr_df = pd.get_dummies(hr_df, columns = ['department', 'region', 'education', 'gender', 'recruitment_channel'])
hr_df.head()

2. 로지스틱 회귀(Logistic Regression)

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regression#sklearn.linear_model.LogisticRegression

sklearn.linear_model.LogisticRegression

Examples using sklearn.linear_model.LogisticRegression: Release Highlights for scikit-learn 1.3 Release Highlights for scikit-learn 1.1 Release Highlights for scikit-learn 1.0 Release Highlights fo...

scikit-learn.org

1. 둘 중의 하나를 결정하는 문제(이진 분류)를 풀기 위한 대표적인 알고리즘
2. 입력 데이터와 가중치의 선형 조합으로 선형 방적식을 만듬 -> 선형 방적식의 결과를 0과 1사이의 확률값으로 변환(시그모이드 함수)
3. 3개 이상의 클래스에 대한 판별을 할 수 있음
	3-1. OvR(One-vs-Rest)
    	각 클래스마다 하나의 이진 분류기를 만들고, 해당 클래스를 기준으로 그 클래스와 나머지 모든 클래스를 구분하는 이진 분류를 수행 -> 가장 높은 확률을 가진 클래스를 선택
        
	3-2. OvO(One-vs-One)
    	클래스의 개수가 N인 경우 N(N-1)/2개의 이진 분류기를 만듬 -> 각 이진 분류기는 두 개의 클래스만 구분하는데, 해당 클래스와 나머지 클래스 간에 이진 분류를 수행 -> 입력 데이터를 각 이진 분류기에 넣어 가장 많이 선택된 클래스를 최종 선택
        
대부분 OvR 전략을 선호. 클래스 간의 구분이 명확하지 않거나 데이터가 한쪽으로 치우친 경우 OvO를 고려
OvO는 메모리를 많이 잡아먹고 속도가 느림.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(hr_df.drop('is_promoted', axis=1),hr_df['is_promoted'],test_size=0.2, random_state=10)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)

from sklearn.metrics import accuracy_score, confusion_matrix
accuracy_score(y_test, pred)
# 결과값 => 0.9114262227702425

hr_df['is_promoted'].value_counts()

confusion_matrix(y_test, pred)

3. 혼돈 행렬(confusion matrix)

정밀도와 재현을(민감도)를 활용하는 평가용 지수

TN(8924) FP(0)

FN(808) TP(0)
TN : 승진하지 못했는데, 승진하지 못했다고 예측
FN : 승진하지 못했는데, 승진했다고 예측
FP : 승진했는데, 승진하지 못했다고 예측
TP : 승진했는데, 승진했다고 예측

3-1. 정밀도(Precision)

TP / (TP + FP)
무조건 양성으로 판단해서 계산하는 방법
실제 1인 것중에 얼마ㅏ 만큼을 제대로 맞췄는가?

3-2. 재현율(recall)

TP / (TP + FN)
정확하게 감지한 양성 샘플의 비율
1이라고 예측한 것 중, 얼마 만큼을 제대로 맞췄는가?
민감도 또는 TPR(True Positive Rate)라고도 부름

3-3. f1 score

정밀도와 재현율의 조화평균을 나타내는 지표

정밀도 재현율 산술평균 조화평균

0.4 0.6 0.5 0.48

0.3 0.7 0.5 0.42

0.5 0.5 0.5 0.5

from sklearn.metrics import precision_score, recall_score, f1_score
precision_score(y_test, pred)
# 결과값 => 1.0

recall_score(y_test, pred)
# 결과값 => 0.0011587485515643105

f1_score(y_test, pred)
# 결과값 => 0.0023148148148148147

# 독립 변수
TempX = hr_df[['avg_training_score', 'previous_year_rating']]
# 종속 변수
tempY = hr_df['is_promoted']
temp_lr = LogisticRegression()
temp_lr.fit(TempX, tempY)

temp_df = pd.DataFrame({'avg_training_score':[60, 80, 100], 'previous_year_rating':[5.0, 4.5, 5.0]})
temp_df

pred = temp_lr.predict(temp_df)
pred
# 결과값 => array([0, 0, 0])

temp_lr.coef_
# 결과값 => array([[0.04565839, 0.51245263]])

temp_lr.intercept_
# 결과값 => array([-7.28583474])

# 확률 0[] 1[]
proba = temp_lr.predict_proba(temp_df)
proba

# 결과값 => 
# array([[0.87911419, 0.12088581],
#        [0.790365  , 0.209635  ],
#        [0.53935167, 0.46064833]])

# 승진할 확률만 출력
proba = temp_lr.predict_proba(temp_df)[:,1]
proba
# 결과값 => array([0.12088581, 0.209635  , 0.46064833])

# 임계값 변경
threshold = 0.4
pred = (proba > threshold).astype(int)
pred
# 결과값 => array([0, 0, 1])

# 낮은 임계값을 사용하면, 더 많은 샘플이 양성 클래스로 분류되어 재현율이 증가하고, 정밀도가 감소함

4. 교차 검증(Cross Validation)

train_test_split에서 발생하는 데이터의 섞임에 따라 성능이 좌우되는 문제를 해결하기 위한 기술
K겹(K-Fold) 교차 검증을 가장 많이 사용

from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
kf
# 결과값 => KFold(n_splits=5, random_state=None, shuffle=False)

hr_df

for train_index, test_index in kf.split(range(len(hr_df))) :
    print(train_index, test_index)
    print(len(train_index), len(test_index))

kf = KFold(n_splits=5, random_state=2023, shuffle=True)
kf
# 결과값 => KFold(n_splits=5, random_state=2023, shuffle=True)

for train_index, test_index in kf.split(range(len(hr_df))) :
    print(train_index, test_index)
    print(len(train_index), len(test_index))

acc_lists = []

for train_index, test_index in kf.split(range(len(hr_df))) :
    X = hr_df.drop('is_promoted', axis=1)# 동립
    y = hr_df['is_promoted']# 종속

    X_train = X.iloc[train_index]
    X_test = X.iloc[test_index]
    y_train = y.iloc[train_index]
    y_test = y.iloc[test_index]

    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    pred = lr.predict(X_test)
    acc_lists.append(accuracy_score(y_test, pred))
    
acc_lists

# 약 91.3%의 정확도
np.array(acc_lists).mean()
# 결과값 => 0.9130291820797372
# 크로스벨리데이션을 사용하는 이유는 결과를 좋게 하기 위함이 아니라 믿을 만한 검증을 하기 위함

'Study > 머신러닝과 딥러닝' 카테고리의 다른 글

[머신러닝과 딥러닝] 9. 랜덤 포레스트 (0)	2024.01.03
[머신러닝과 딥러닝] 8. 서포트 백터 머신 (0)	2024.01.02
[머신러닝과 딥러닝] 6. 의사 결정 나무 (0)	2023.12.29
[머신러닝과 딥러닝] 5. 선형 회귀 (0)	2023.12.28
[머신러닝과 딥러닝] 4. 타이타닉 데이터셋 (0)	2023.12.27

[머신러닝과 딥러닝] 7. 로지스틱 회귀

'Study > 머신러닝과 딥러닝' 카테고리의 다른 글

티스토리툴바