Notice
Recent Posts
Recent Comments
Link
ยซ   2026/06   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
Archives
Today
Total
๊ด€๋ฆฌ ๋ฉ”๋‰ด

project:eve

23.01.31 KNN ์ตœ๊ทผ์ ‘ ์ด์›ƒ ๋ณธ๋ฌธ

Daily

23.01.31 KNN ์ตœ๊ทผ์ ‘ ์ด์›ƒ

eveee 2023. 1. 31. 00:38

Studying here๐Ÿ ํ…Œ๋ผ๋กœ์‚ฌ ๊ด‘ํ™”๋ฌธ์ 

 

 

์˜ค๋Š˜์€ K-nearest neighbor(K-์ตœ๊ทผ์ ‘ ์ด์›ƒ)์„ ๊ณต๋ถ€ํ•˜๋ ค๊ณ  ํ•œ๋‹ค. ์ง์ „์— SVM ๊ณต๋ถ€ํ•  ๋•Œ ๋„ˆ๋ฌด ์–ด๋ ค์› ์–ด์„œ ์ด๋ฒˆ์—๋Š” ์ข€ ๋‚˜์•„์กŒ๊ธธ ๋ฐ”๋ž€๋‹คใ… 

 

 


 

1) ์ •์˜

ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ๊ทธ๋Œ€๋กœ ์ €์žฅํ•œ ๋’ค, ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ํด๋ž˜์Šค๋ฅผ ํŒ๋‹จํ•  ๋•Œ ์ฃผ๋ณ€์˜ k๊ฐœ์˜ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฐ์ดํ„ฐ์˜ ํด๋ž˜์Šค์— ์˜ํ•ด ๊ฒฐ์ •ํ•œ๋‹ค. ๋น„์„ ํ˜• ๋ฐ์ดํ„ฐ์—์„œ ๋†’์€ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๊ณ  ํ•œ๋‹ค.

 

k๊ฐ’์„ ๋ฌด์—‡์œผ๋กœ ์ž…๋ ฅํ•˜๋А๋ƒ์— ๋”ฐ๋ผ ํŒ๋‹จํ•˜๋Š” ํด๋ž˜์Šค๊ฐ€ ๋‹ฌ๋ผ์ง.

 

2) ์ข…๋ฅ˜

 

1-๋ถ„๋ฅ˜ : ๋ถ„๋ฅ˜ ๋ฐฉ์‹์€ ์œ„์˜ ์‚ฐํฌ๋„ ๊ทธ๋ž˜ํ”„์™€ ๋˜‘๊ฐ™๋‹ค. ๊ทธ๋ž˜ํ”„์˜ ํ•œ ์ ์—์„œ '๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ'์˜ k๊ฐœ์˜ ๋ฐ์ดํ„ฐ ํด๋ž˜์Šค๋ฅผ ๋ณด๊ณ  ํ•œ ์ ์˜ ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. 

 

*)๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•

 - ๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ ๋ฒ”์ฃผํ˜•์ผ ๊ฒฝ์šฐ : ํ•ด๋ฐ ๊ฑฐ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค

 

 - ๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ ์—ฐ์†ํ˜•์ผ ๊ฒฝ์šฐ : ์œ ํด๋ฆฌ๋“œ, ๋งจํ•˜ํƒ„ ๊ฑฐ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

(์™ผ)์œ ํด๋ฆฌ๋“œ๊ฑฐ๋ฆฌ, (์˜ค)๋งจํ•˜ํƒ„๊ฑฐ๋ฆฌ

 

*) ์ฃผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์†์„ฑ

n_neighbors : ์ด์›ƒ์˜ ์ˆ˜
weights : ์ด์›ƒ๋ณ„ ๊ฐ€์ค‘์น˜. uniform์€ ๋ชจ๋“  ์ด์›ƒ์„ ๊ฑฐ๋ฆฌ์— ์ƒ๊ด€์—†์ด ๊ฐ™์€ ๊ณ„์‚ฐ, distance๋Š” ์ด์›ƒ๋ณ„ ๊ฑฐ๋ฆฌ์— ๋”ฐ๋ผ ๊ฐ€์ค‘์น˜ ๋ถ€์—ฌ
metric&p : ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ ๊ณต์‹๊ณผ ๊ณ„์ˆ˜. ๊ธฐ๋ณธ๊ฐ’์€ minkowski์™€ 2๋กœ, ๋ฏผ์ฝ”์šฐ์Šคํ‚ค ๊ฑฐ๋ฆฌ๊ณต์‹์˜ ๊ณ„์ˆ˜๊ฐ€ 2์ผ๋•Œ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ์™€ ๊ฐ™์œผ๋ฏ€๋กœ ๊ธฐ๋ณธ๊ฐ’์€ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

1- ํšŒ๊ท€ : ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์˜ ๋…๋ฆฝ๋ณ€์ˆ˜(x)์™€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด k๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ ํด๋ž˜์Šค๋ฅผ ํŒ๋ณ„ํ•œ๋‹ค. ๋‹ค๋ฅธ ์„ ํ˜•ํšŒ๊ท€์™€ ๋‹ค๋ฅด๊ฒŒ ํšŒ๊ท€์‹์ด ์ •ํ•ด์ ธ ์žˆ์ง€ ์•Š๊ณ  ํšŒ๊ท€๊ณ„์ˆ˜๋„ ์—†๋‹ค๋Š” ๊ฒƒ์ด ํŠน์ง•!

k=3์ผ ๊ฒฝ์šฐ ์ด๋Ÿฐ์‹์œผ๋กœ 3๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์„ ํƒํ•œ๋‹ค

 

 

 

 

3) ์ฝ”๋“œ

 

1-๋ถ„๋ฅ˜ : ์ธ๋„ ๊ฐ„์งˆํ™˜์ž์— ๋Œ€ํ•œ ๋ถ„๋ฅ˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€ ๋ถ„์„ํ•ด๋ณด์•˜๋‹ค. ์ข…์†๋ณ€์ˆ˜๋Š” Dataset.

(๋ฐ์ดํ„ฐ ์ถœ์ฒ˜ : https://www.kaggle.com/datasets/uciml/indian-liver-patient-records)

 

๋จผ์ € ๋ฐ์ดํ„ฐ์— ๊ฒฐ์ธก์น˜๊ฐ€ ์—†๋Š”์ง€ ํ™•์ธํ•ด๋ณด์ž.

import pandas as pd
import numpy as np

data = pd.read_csv('/Users/eve/Downloads/jupyter notebook/files/indian_liver_patient.csv')

print(data.info())
print(data.head(5))

Albumin_and_Globulin_Ratio(์ด๋‹จ๋ฐฑ์งˆ์ด๋ผ๊ณ  ํ•œ๋‹ค)์— ๊ฒฐ์ธก์น˜๊ฐ€ 4๊ฐœ ๋น ์ ธ์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

์ด ๋ฐ์ดํ„ฐ์— ๋น„ํ•ด ์ ์€ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•˜๋‹ˆ ์ง€์›Œ๋ฒ„๋ฆฌ๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ์ง€๋งŒ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ์˜ ํ‰๊ท ์น˜๋กœ ๋Œ€์ฒดํ–ˆ๋‹ค.

col_mean = round(data['Albumin_and_Globulin_Ratio'].mean(), 2)

#๋‹ค๋ฅธ ๋ฐฉ๋ฒ•
#data.loc[data['Albumin_and_Globulin_Ratio'].isna()==True, :]['Albumin_and_Globulin_Ratio'] = col_mean


data['Albumin_and_Globulin_Ratio']=data['Albumin_and_Globulin_Ratio'].fillna(col_mean)

 

๋˜ํ•œ ๋…๋ฆฝ๋ณ€์ˆ˜ ์ค‘ Gender์˜ ๋ฐ์ดํ„ฐ ํ˜•์‹์ด object์ด๋ฏ€๋กœ ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ํ•ด์ค€๋‹ค.

(* ๋ฌธ์žํ˜• ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์„ ๋•Œ fitํ•˜๋ฉด ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์™œ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š”์ง€ ๊ตฌ๊ธ€์— ํ•œ์ฐธ ์ฐพ์•„๋ดค๋Š”๋ฐ๋„ ์•ˆ ๋‚˜์˜จ๋‹ค. ์‚ฌ๋žŒ๋“ค์€ ์•ˆ ๊ถ๊ธˆํ•œ๊ฐ€..? ์•„๋‹˜ ๋‚ด๊ฐ€ ๋„ˆ๋ฌด ์ดˆ๋ณด์ ์ธ ๋‚ด์šฉ์„ ๋ชฐ๋ผ์„œ ์ฐพ๊ณ  ์žˆ์–ด์„œ ์•ˆ ๋‚˜์˜ค๋‚˜..??)

์•„๋ฌดํŠผ KNN๋ถ„๋ฅ˜๊ธฐ์—์„œ๋Š” ๋…๋ฆฝ๋ณ€์ˆ˜๋Š” ๋ชจ๋‘ ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜์—ฌ์•ผ๋งŒ ํ•œ๋‹ค. ๊ทผ๋ฐ ์ข…์†๋ณ€์ˆ˜๋Š” ๋ฌธ์žํ˜•์ด์–ด๋„ ๊ดœ์ฐฎ์œผ๋‹ˆ ์ฐธ๊ณ .

 

 

 

- could not convert string to float: 'Male'

โฌ‡๏ธ์ด๋ ‡๊ฒŒ

data['Gender'] = np.where(data['Gender']=='Female', 1, 0)

 

๋˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ํ•ด์ฃผ์ž. ๋‹ค์Œ ๋ถ„์„ํ•  ๋•Œ np.where๋ฅผ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ๋ž‘ ์ด๊ฑฐ๋ž‘ ์ฐจ์ด๋ฅผ ๋น„๊ตํ•ด๋ณด๋Š” ๊ฒƒ๋„ ์ข‹๊ฒ ๋‹ค.

# ohe๋ฅผ ์‚ฌ์šฉํ•ด ์›ํ•ซ์ธ์ฝ”๋”ฉ

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False)

data_cat = ohe.fit_transform(data[['Gender']])

cat_Gender = pd.DataFrame(data_encoded, columns=['cat_'+ cat for cat in ohe.categories_[0]])

data_concated = pd.concat([data, cat_Gender], axis=1).drop(columns=['Gender'])

data_concated

๋งˆ์ง€๋ง‰ ์—ด์ด์—ˆ๋˜ dataset ๋’ค์— ๋งŒ๋“ค๊ณ  ๊ธฐ์กด ์—ด์ด์—ˆ๋˜ Gender๋Š” ์ง€์› ๋‹ค

 

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ 2๊ฐœ๋กœ ๋‚˜๋ˆ„์—ˆ๋‹ค. X๋Š” ๋ฌธ์žํ˜• ๋ฐ์ดํ„ฐ๋ฅผ np.where๋กœ ๋ฐ”๊พผ ๋ฐ์ดํ„ฐ, X_2๋Š” ohe๋กœ ๋ฐ”๊พผ ๋ฐ์ดํ„ฐ.

from sklearn.model_selection import train_test_split

X = data.drop(columns=['Dataset'])
y = data['Dataset']

X_2 = data_concated.drop(columns=['Dataset'])



train_x, test_x, train_y, test_y = train_test_split(X, y, train_size=0.7, random_state=1,
                                                   stratify=y)

train_x_ohe, test_x_ohe, train_y_ohe, test_y_ohe = train_test_split(X_2, y, train_size=0.7, random_state=1,
                                                   stratify=y)

 

๋ฐ์ดํ„ฐ ํ›ˆ๋ จ๊ณผ ์˜ˆ์ธก๋„ ๊ฐ๊ฐ. n_neighbors ๊ฐ’์€ 3์œผ๋กœ ์„ค์ •ํ–ˆ๋‹ค.

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3)

clf.fit(train_x, train_y)

pred = clf.predict(test_x)


clf = KNeighborsClassifier(n_neighbors=3)

clf.fit(train_x_ohe, train_y_ohe)

pred_ohe = clf.predict(test_x_ohe)

 

๊ทธ๋ฆฌ๊ณ  ์„ฑ๋Šฅ ํ‰๊ฐ€. ์ด๋ ‡๊ฒŒ ์ผ์ผ์ด ๋…ธ๊ฐ€๋‹คํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค for๋ฌธ์„ ๋งŒ๋“ค์—ˆ์œผ๋ฉด ์ข‹์•˜์„ํ…๋ฐ..

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

liver_cm = confusion_matrix(test_y, pred)
liver_acc = round(accuracy_score(test_y, pred), 3)
liver_prc = round(precision_score(test_y, pred), 3)
liver_rc = recall_score(test_y, pred)
liver_f1 = round(f1_score(test_y, pred), 3)

liver_cm_ohe = confusion_matrix(test_y, pred_ohe)
liver_acc_ohe = round(accuracy_score(test_y, pred_ohe), 3)
liver_prc_ohe = round(precision_score(test_y, pred_ohe), 3)
liver_rc_ohe = recall_score(test_y, pred_ohe)
liver_f1_ohe = round(f1_score(test_y, pred_ohe), 3)

print('confusion matrix : ')
print(liver_cm, liver_cm_ohe, '\n')
print('acc score : ', liver_acc, liver_acc_ohe)
print('prc score : ', liver_prc, liver_prc_ohe)
print('rc score : ', liver_rc, liver_rc_ohe)
print('f1 score : ', liver_f1, liver_f1_ohe)

 

๊ณ ์ƒํ•˜๋ฉด์„œ ๋‘ ๊ฐ’์„ ๊ตฌํ•ด์„œ ๋น„๊ตํ•ด๋ดค๋Š”๋ฐ ๋˜‘๊ฐ™๋‹ค.. ์ด๋Ÿด๊ฑฐ๋ฉด ๊ทธ๋ƒฅ np.where๊ฐ€ ํ›จ์”ฌ ๋‚ซ์ง€ ์™œ ohe๋ฅผ ์“ธ๊นŒ? ๋ญ”๊ฐ€ ์žฅ์ ์ด ์žˆ์„ํ…๋ฐ, ๊ทธ๊ฑด ๋‚˜์ค‘์— ์•Œ์•„๋ด์•ผ๊ฒ ๋‹ค.(์ด์ง„๋ถ„๋ฅ˜๊ฐ€ ์•„๋‹Œ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๊ฐ’์— ํšจ์œจ์ด ์ข‹์„์ง€๋„ ๋ชจ๋ฅธ๋‹ค)

 

 

2-ํšŒ๊ท€

 

์ด๋ฒˆ์—๋Š” KNN์œผ๋กœ ํšŒ๊ท€๋ถ„์„์„ ํ•ด๋ณด์ž. ์ด๋ฒˆ์— ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ๋Š” ์ธ์‚ฌํŒ€ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. ์ง์›๋“ค์˜ ์ด๋ฆ„๋ถ€ํ„ฐ ๋ด‰๊ธ‰, ๊ฒฐํ˜ผ ์—ฌ๋ถ€, ๋ถ€์„œ, ํœด๊ฐ€์ผ์ˆ˜, ํ”„๋กœ์ ํŠธ ๊ฒฝํ—˜ ๋“ฑ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์ •๋ณด๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ์ธ๋ฐ, ์—ฌ๊ธฐ์—์„œ ์ข…์†๋ณ€์ˆ˜๋ฅผ ๋ด‰๊ธ‰์œผ๋กœ ์ง€์ •ํ•ด์„œ ํšŒ๊ท€ ๋ถ„์„์„ ํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค.

hr = pd.read_csv('/Users/eve/Downloads/jupyter notebook/files/regression/HRdataset.csv')

hr.info()

๋…๋ฆฝ๋ณ€์ˆ˜๋ฅผ ์–ด๋А ์ •๋„ ํ•„ํ„ฐ๋งํ•ด์„œ ๋ฐ์ดํ„ฐ ์“ฐ๊ธฐ

 

๋…๋ฆฝ๋ณ€์ˆ˜๋Š” ์–ด๋А ์ •๋„ ๊ด€๋ จ์ด ์žˆ์–ด ๋ณด์ด๋Š” ๊ฒƒ์„ ๊ฐ€์ ธ์™”๋‹ค. (๋ฒ•์ )๊ฒฐํ˜ผ ์—ฌ๋ถ€, (์‚ฌ์‹ค)ํ˜ผ์ธ ์—ฌ๋ถ€, ์„ฑ๋ณ„ ๋“ฑ๋“ฑ. ๋ฐ์ดํ„ฐ๋ฅผ ์„ ์ •ํ•œ ๋‹ค์Œ์—๋Š” ํ›ˆ๋ จ/๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌํ•œ๋‹ค.

hr_1 = hr[['MarriedID', 'MaritalStatusID', 'GenderID', 'EmpStatusID', 'DeptID', 'PerfScoreID', 'FromDiversityJobFairID', 'EngagementSurvey', 'EmpSatisfaction', 'SpecialProjectsCount', 'Absences', 'Salary']]

X = hr_1.drop(columns=['Salary'])
y = hr_1['Salary']

from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(X, y, train_size=0.7, random_state=1)

 

ํšŒ๊ท€ํ•จ์ˆ˜๋ฅผ ์„ ์–ธํ•˜๊ณ  k๊ฐ’์„ 3, 5 ๋‘ ๊ฐœ๋กœ ๋‚˜๋ˆ„์–ด ํ…Œ์ŠคํŠธํ•ด๋ณด๊ธฐ๋กœ ํ•œ๋‹ค. for๋ฌธ ์•ˆ์—์„œ ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ์˜ ์ด์›ƒ์ด 3์ธ ํ•จ์ˆ˜์™€ 5์ธ ํ•จ์ˆ˜์— ์˜ํ•ด ์˜ˆ์ธก๋œ ๊ฐ’์„ ์‹ค์ œ ๊ฐ’๊ณผ ์˜ค์ฐจ๋ฅผ ๊ณ„์‚ฐํ•œ ๋’ค ๋น„๊ตํ•ด ๋ณด์ž.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

preds = ['reg_3', 'reg_5']
knum = [3, 5]

result = pd.DataFrame(columns=['preds', 'mae', 'mse', 'rmse'])

result['preds'] = preds


for k, name in zip(knum, preds):
    reg = KNeighborsRegressor(n_neighbors=k)
    
    reg.fit(train_x, train_y)
    pred = reg.predict(test_x)
    
    mae = round(mean_absolute_error(test_y, pred), 2)
    mse = round(mean_squared_error(test_y, pred))
    rmse = np.sqrt(mse)
    
    result.loc[result['preds']==name, 'mae'] = mae
    result.loc[result['preds']==name, 'mse'] = mse
    result.loc[result['preds']==name, 'rmse'] = rmse

์˜ค์ฐจ ๊ฐ’์ด ์ข€ ํฌ๋‹ค..? ๋ด‰๊ธ‰์„ ์ •๊ทœํ™”ํ•˜์ง€ ์•Š๊ณ  ๊ทธ๋Œ€๋กœ ๋„ฃ์—ˆ๋”๋‹ˆ ์ด๋ ‡๊ฒŒ ์•Œ์•„๋ณด๊ธฐ ํž˜๋“  ๊ฒƒ ๊ฐ™๋‹ค. ์ •๊ทœํ™”๋ฅผ ํ•œ ๋’ค์— ๋ถ„์„์„ ํ•ด๋ณด๋ฉด ๊น”๋”ํ•˜๊ฒŒ ์ฐจ์ด๋ฅผ ์•Œ ๊ฒƒ ๊ฐ™๋‹ค. ์–ด์จŒ๋“  ์ง€๊ธˆ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด k๊ฐ’์ด 5์ผ ๋•Œ ๋ถ„์„ ๊ฒฐ๊ณผ๊ฐ€ ๋” ์ข‹์€ ๊ฒƒ ๊ฐ™๋‹ค.