Notice
Recent Posts
Recent Comments
Link
ยซ   2026/06   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
Archives
Today
Total
๊ด€๋ฆฌ ๋ฉ”๋‰ด

project:eve

23.02.04 ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ๋ณธ๋ฌธ

Daily

23.02.04 ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด

eveee 2023. 2. 4. 21:11

Studying here๐Ÿ ์ฝ˜ํ•˜์Šค ์—ฐํฌ์ 

 

7๋ฒˆ์งธ ๋ถ„์„ ๋ชจ๋ธ์ธ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์ด๋‹ค. ์ด์ œ ๋จธ์‹ ๋Ÿฌ๋‹ ํŒŒํŠธ๋Š” ๋ฐ˜ ์ •๋„ ๋ฐฐ์šด ๊ฒƒ ๊ฐ™๋‹ค...(์ฑ… ๊ธฐ์ค€) ์กฐ๊ธˆ๋งŒ ๋” ์—ด์‹ฌํžˆ ํ•ด๋ณด์ž!!

 

 


 

 

1) ์ •์˜ : ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๊ทœ์น™์„ ์„ธ์›Œ ํ•™์Šตํ•˜๊ณ  ๊ทธ ๊ทœ์น™์— ๋”ฐ๋ผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจํ˜•

 - ํŠน์ง• : ์‹œ๊ฐํ™”ํ•˜๊ธฐ ๊ฐ„ํŽธํ•˜๋‹ค. ์ „์ฒ˜๋ฆฌ๋ฅผ ์•ˆ ํ•ด๋„ ๋œ๋‹ค / ์„ฑ๋Šฅ์ด ์•ˆ ์ข‹๋‹ค....

๋…ธ๋ž€์ƒ‰ ๋ฐ•์Šค๊ฐ€ ๋ถ„๋ฆฌ ๊ทœ์น™

 

 

2) ์ข…๋ฅ˜์™€ ๋ถ„๋ฆฌ ๊ธฐ์ค€

1-  ๋ถ„๋ฅ˜ : ์นด์ด์ œ๊ณฑ ํ†ต๊ณ„๋Ÿ‰ p๊ฐ’, ์ง€๋‹ˆ์ง€์ˆ˜, ์—”ํŠธ๋กœํ”ผ์ง€์ˆ˜

 

2- ํšŒ๊ท€ : ๋ถ„์‚ฐ๋ถ„์„ Fํ†ต๊ณ„๋Ÿ‰, ๋ถ„์‚ฐ์˜ ๊ฐ์†Œ๋Ÿ‰

 

 

 

3) ์ฝ”๋“œ

 

1- ๋ถ„๋ฅ˜ : ๋…์ผ ์‹ ์šฉ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„์„์„ ํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค.

์ถœ์ฒ˜ : ( https://archive-beta.ics.uci.edu/dataset/144/statlog+german+credit+data)

 

๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€ ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š”์ง€ ์‚ดํŽด๋ณด์ž

import pandas as pd

credit = pd.read_csv('/Users/eve/Downloads/jupyter notebook/files/germancredit.csv')

credit.info()

๋ฐ์ดํ„ฐ ๋ฌธ์ œ ์—†์Œ. ๊ฒฐ์ธก์น˜ ์—†๊ณ  ๋ฐ์ดํ„ฐ ํ˜•์‹๋„ ๋ชจ๋‘ ์ˆ˜์น˜ํ˜•

(* ์ด์ œ๊นŒ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๋ฉด์„œ ์ „์ฒ˜๋ฆฌ๋ฅผ ๋นก์„ธ๊ฒŒ ํ•œ ์ ์ด ์—†๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๋Œ€์ถฉ ๊ฒฐ์ธก์น˜ ์žˆ์œผ๋ฉด ํ‰๊ท ๊ฐ’ ๋„ฃ๊ฑฐ๋‚˜ ์•„๋‹˜ ๊ทธ๋ƒฅ ํ–‰์„ ๋นผ๋ฒ„๋ฆฌ๊ฑฐ๋‚˜.. ์ •๊ทœ์„ฑ, ์„ ํ˜•์„ฑ์ด๋‚˜ ๋ถ„ํฌ ๊ฐ™์€ ๊ฒƒ๋„ ๋‹ค ๋ณด๋ฉด ์ข‹์€๋ฐ. ๋‚˜์ค‘์— ๊ฒŒ์‹œ๊ธ€ ํ•˜๋‚˜๋ฅผ ์ „์ฒ˜๋ฆฌ๋งŒ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ์œผ๋กœ ํ•˜๋‚˜ ์จ์•ผ๊ฒ ๋‹ค)

 

 

๋…๋ฆฝ๋ณ€์ˆ˜์™€ ์ข…์†๋ณ€์ˆ˜๋ฅผ ๋‚˜๋ˆ„๊ณ  ํ•™์Šต/๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•œ๋‹ค.

X = credit.drop(columns=['OBS', 'RESPONSE'])
y = credit['RESPONSE']

from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(X, y, train_size = 0.7, random_state=1, stratify=y)

 

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ํ˜ธ์ถœํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต/์˜ˆ์ธกํ•œ๋‹ค. ์—ฌ๊ธฐ์—์„œ ๋ถ„๋ฅ˜๊ธฐ์˜ ์†์„ฑ์„ ์ถ”๊ฐ€ํ•ด์„œ ์–ด๋А ๊ฒƒ์ด ๋” ์„ฑ๋Šฅ์ด ์ข‹์€์ง€ ์•Œ์•„๋ณด์ž. ์œ„์—์„œ ๋งํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ๋ถ„๋ฆฌ ๊ธฐ์ค€์„ ํ•˜๋‚˜๋Š” ์ง€๋‹ˆ์ง€์ˆ˜, ํ•˜๋‚˜๋Š” ์—”ํŠธ๋กœํ”ผ์ง€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ง€์ •ํ•ด์„œ ๋น„๊ตํ•ด๋ณด๊ธฐ

from sklearn.tree import DecisionTreeClassifier

clf_gn = DecisionTreeClassifier(criterion='gini', min_samples_split=50, max_depth=5)
clf_et = DecisionTreeClassifier(criterion='entropy', min_samples_split=50, max_depth=5)

clf_gn.fit(train_x, train_y)
clf_et.fit(train_x, train_y)

pred_gn = clf_gn.predict(test_x)
pred_et = clf_et.predict(test_x)

 

 

์„ฑ๋Šฅ ํ‰๊ฐ€ํ•ด๋ณด์ž. ๋ณ€์ˆ˜๋ฅผ ๋‹ค ๋‚˜๋ˆ„์–ด์„œ ๊ฐ๊ฐ ๊ณ„์‚ฐํ•˜๊ธฐ

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import plot_roc_curve, roc_auc_score

clf_gn_cm = confusion_matrix(test_y, pred_gn)
clf_gn_acc = round(accuracy_score(test_y, pred_gn), 2)
clf_gn_prc = round(precision_score(test_y, pred_gn), 2)
clf_gn_rc = round(recall_score(test_y, pred_gn), 2)
clf_gn_f1 = round(f1_score(test_y, pred_gn), 2)

clf_et_cm = confusion_matrix(test_y, pred_et)
clf_et_acc = round(accuracy_score(test_y, pred_et), 2)
clf_et_prc = round(precision_score(test_y, pred_et), 2)
clf_et_rc = round(recall_score(test_y, pred_et), 2)
clf_et_f1 = round(f1_score(test_y, pred_et), 2)

 

 

๋น„๊ตํ•˜๊ธฐ. ์›๋ž˜๋ผ๋ฉด for๋ฌธ ๋Œ๋ ค์„œ ๋ถ„๋ฅ˜๊ธฐ ์ƒ์„ฑ๋ถ€ํ„ฐ ์„ฑ๋Šฅ ํ‰๊ฐ€๊นŒ์ง€ ๋ชจ๋‘ ๋งŒ๋“ค๋ฉด ๋˜๋Š”๋ฐ ์˜ค๋Š˜ ์™œ์ด๋ฆฌ ๊ณต๋ถ€ํ•˜๊ธฐ ํž˜๋“ค๊นŒ.. ๊ทธ๋ƒฅ ๋…ธ๊ฐ€๋‹ค๋กœ ๋งŒ๋“ค์—ˆ๋‹ค

result = pd.DataFrame(columns=['criterion', 'acc', 'prc', 'rc', 'f1'])

result['criterion'] = ['gini', 'entropy']
result.loc[result['criterion']=='gini','acc'] = clf_gn_acc
result.loc[result['criterion']=='gini','prc'] = clf_gn_prc
result.loc[result['criterion']=='gini','rc'] = clf_gn_rc
result.loc[result['criterion']=='gini','f1'] = clf_gn_f1

result.loc[result['criterion']=='entropy','acc'] = clf_et_acc
result.loc[result['criterion']=='entropy','prc'] = clf_et_prc
result.loc[result['criterion']=='entropy','rc'] = clf_et_rc
result.loc[result['criterion']=='entropy','f1'] = clf_et_f1

result

์ •ํ™•๋„๋Š” ์ง€๋‹ˆ ๊ณ„์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ์ชฝ์ด ์กฐ๊ธˆ ๋” ๋†’์€๋ฐ ์žฌํ˜„์œจ์ด ์ฐจ์ด๊ฐ€ ๋‚œ๋‹ค. ๋‘๊ฐ€์ง€ ์ค‘์— ํ•˜๋‚˜๋ฅผ ๊ณจ๋ผ์•ผ ํ•œ๋‹ค๋ฉด ์žฌํ˜„์œจ์ด ๋†’์€ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ ๊ฐ™๋‹ค. 

๋˜ ์ด๋ฒˆ์—๋Š” ์„ฑ๋Šฅ ํ‰๊ฐ€์™€ ๊ด€๋ จ๋œ ๋‹ค๋ฅธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ•จ์ˆ˜๋„ ์จ๋ดค๋‹ค. roc๊ทธ๋ž˜ํ”„, auc์ ์ˆ˜, ๊ทธ๋ฆฌ๊ณ  classification report ํ•จ์ˆ˜๋Š” ์‹ค์ œ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’์„ ๋„ฃ์œผ๋ฉด ์ •ํ™•๋„, ์ •๋ฐ€๋„, ์žฌํ˜„์œจ๊ณผ f1์ ์ˆ˜๊นŒ์ง€ ์ „๋ถ€ ๊ตฌํ•ด์ค€๋‹ค! ๋‹ค์Œ๋ถ€ํ„ฐ๋Š” ์ด ํ•จ์ˆ˜๋งŒ ์จ์•ผ๊ฒ ๋‹ค. 

์ฝ”๋“œ์—์„œ๋Š” ์ง€๋‹ˆ ๋ถ„๋ฅ˜๊ธฐ์™€ ์—”ํŠธ๋กœํ”ผ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ๋‚˜๋ˆ„์–ด์„œ ๊ทธ๋ž˜ํ”„์™€ ์„ฑ๋Šฅ์ง€ํ‘œ๋ฅผ ํ‘œ์‹œํ•ด๋ดค๋‹ค. 

from sklearn.metrics import plot_roc_curve, roc_auc_score
from sklearn.metrics import classification_report

fig = plt.figure(figsize=(10, 8))
plot_roc_curve(clf_gn, test_x, test_y)
plt.show()
clf_report_gn = classification_report(test_y, pred_gn)
print(clf_report_gn)


plot_roc_curve(clf_et, test_x, test_y)
clf_report_et = classification_report(test_y, pred_et)
plt.show()
print(clf_report_et)

 

 

auc์ ์ˆ˜๋กœ ๋ณด๋ฉด ์ง€๋‹ˆ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์•„์ฃผ ์กฐ๊ธˆ ๋” ์„ฑ๋Šฅ์ด ์ข‹์€ ๊ฒƒ ๊ฐ™๋‹ค. classification report๋Š” ์ •๋ง ํŽธํ•ด์„œ ์•ž์œผ๋กœ๋Š” ์„ฑ๋Šฅ ์ง€ํ‘œ ๊ตฌํ•  ๋•Œ ์ด๊ฒƒ๋งŒ ์“ธ ๊ฒƒ ๊ฐ™๋‹ค. 

 


๊ทธ๋ฆฌ๊ณ  ํ•œ๊ฐ€์ง€ ๋”) ๋ถ„์„์„ ํ•˜๊ณ  ๋‚˜์„œ ์ปฌ๋Ÿผ๋ณ„๋กœ ์ค‘์š”๋„๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

ft_imp = clf_gn.feature_importances_

print(ft_imp)

์ถœ๋ ฅํ•˜๋ฉด ๋„˜ํŒŒ์ด ๋ฐฐ์—ด๋กœ ๋‚˜์˜จ๋‹ค.

 

์ค‘์š”๋„ ๋ฐฐ์—ด์€ ์›๋ณธ ๋ฐ์ดํ„ฐ์˜ ์—ด ์ˆœ์„œ๋Œ€๋กœ ๋‚˜์˜ค๋ฏ€๋กœ ๋ณด๊ธฐ ์‰ฝ๊ฒŒ ๋ฐ์ดํ„ฐ์—์„œ ์—ด์„ ๊ฐ€์ ธ์™€ ํ•ฉ์ณ์ฃผ๋ฉด ๋œ๋‹ค. pd.concat์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ๊ฐ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋งŒ๋“ค์–ด์ฃผ๋Š” ๊ฒƒ ํฌํ•จ

df_ft_imp = pd.DataFrame(ft_imp)

col = pd.DataFrame(X.columns)

imp = pd.concat([col, df_ft_imp], axis=1)

imp.columns = ['col', 'importance']

imp.sort_values(by = 'importance', axis=0, ascending=False)

๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ƒ์œ„ 5๊ฐœ ์—ด๋งŒ ๊ฐ€์ ธ์™”๋‹ค. 30๊ฐœ์— ๊ฐ€๊นŒ์šด ์—ด ์ค‘ 10๊ฐœ ์ •๋„๋งŒ ์˜ํ–ฅ๋ ฅ์ด ์žˆ๊ณ  ๋‚˜๋จธ์ง€๋Š” ๊ฑฐ์˜ 0์ด๋‹ค.

 

 

 

 

2- ํšŒ๊ท€ : ๋‚˜์ด ๋“ฑ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์š”์ธ์— ๋”ฐ๋ฅธ ๋ณดํ—˜๋ฃŒ ์ž๋ฃŒ๋ฅผ ๊ฐ€์ง€๊ณ  ๋ถ„์„ํ•ด๋ดค๋‹ค.

 

๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ณ  ์ „์ฒ˜๋ฆฌ ํ•ด๋ณด๊ธฐ

import pandas as pd

ins = pd.read_csv('/Users/eve/Downloads/jupyter notebook/files/insurance.csv')
ins.info()

 

๊ฒฐ์ธก์น˜๋Š” ์—†๊ณ , ๋ฌธ์žํ˜•์‹ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ช‡ ๊ฐœ ๋ณด์ธ๋‹ค. ์ˆ˜์น˜ํ˜•์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ์ž. 

ins[['sex', 'smoker']]

ins['sex'] = np.where(ins['sex']=='female', 1, 0)
ins['smoker'] = np.where(ins['smoker']=='yes', 1, 0)

 

์ „์ฒ˜๋ฆฌ๊ฐ€ ๋๋‚ฌ์œผ๋ฉด ๋ฐ์ดํ„ฐ ๋ถ„ํ• ํ•ด ์ฃผ๊ธฐ.

X = ins_1.drop(columns=['charges'])
y = ins_1['charges']

from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(X, y, train_size=0.7, random_state=1)

 

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ํšŒ๊ท€ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜๊ณ  ์ฃผ์š” ์†์„ฑ์ธ max_depth๊ฐ€ 3์ธ ๊ฒƒ๊ณผ 5์ธ ๊ฒƒ์œผ๋กœ ๊ฐ๊ฐ ๋ถ„์„ํ•ด ์ฐจ์ด๋ฅผ ๋น„๊ตํ•ด ๋ณด๊ธฐ๋กœ ํ–ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” ์ผ์ผ์ด ํ•˜์ง€ ์•Š๊ณ  ๋ฐ˜๋ณต๋ฌธ์œผ๋กœ ๋งŒ๋“ค์–ด๋ดค๋‹ค. 

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

preds = ['reg_5', 'reg_3']

result = pd.DataFrame(columns=['preds', 'mae', 'mse', 'rmse'])
result['preds'] = preds

depth = [5, 3]

for i, d in enumerate(depth):
    
    reg = DecisionTreeRegressor(max_depth=d)
    reg.fit(train_x, train_y)
    pred = reg.predict(test_x)
    
    mae = round(mean_absolute_error(test_y, pred), 2)
    mse = round(mean_squared_error(test_y, pred))
    rmse = round(np.sqrt(mse), 2)
    
    result.loc[i, 'mae'] = mae
    result.loc[i, 'mse'] = mse
    result.loc[i, 'rmse'] = rmse
    
print(result)

 

์˜ค์ฐจ๊ฐ€ ์ ์„์ˆ˜๋ก ์ข‹์€ ๋ถ„์„๊ธฐ์ด๋ฏ€๋กœ max_depth๊ฐ€ 3์ธ ๊ฒƒ์ด ์กฐ๊ธˆ ๋” ์ข‹์•„๋ณด์ด๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ์ˆซ์ž๊ฐ€ ์ด๋ ‡๊ฒŒ ํฌ๊ฒŒ ๋‚˜์˜ค๋Š” ๊ฑด ์•„๋ฌด๋ž˜๋„ ์ข…์†๋ณ€์ˆ˜์˜ ๊ฐ’ ํญ์ด ๋„“์–ด์„œ ๊ฐ™์€๋ฐ.. ์ •๊ทœํ™”๋ฅผ ๋จผ์ € ํ•˜๊ณ  ๋ถ„์„ํ–ˆ์œผ๋ฉด ๋ณด๊ธฐ ํŽธํ–ˆ์„ ๊ฒƒ ๊ฐ™๋‹ค.