Notice
Recent Posts
Recent Comments
Link
ยซ   2026/06   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
Archives
Today
Total
๊ด€๋ฆฌ ๋ฉ”๋‰ด

project:eve

23.02.11 Naive Bayes ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ณธ๋ฌธ

Daily

23.02.11 Naive Bayes ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ

eveee 2023. 2. 6. 23:31

 

Studying here๐Ÿ ์Šคํƒ€๋ฒ…์Šค ๋”์ข…๋กœR์ 

 

 

๋‚˜์ด๋ธŒ๋ฒ ์ด์ฆˆ(naive-bayes)


๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ๋Š” ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๋ถ„์„ ๋ชจ๋ธ. ์ „์ฒด์˜ ํ™•๋ฅ  ๋ถ„ํฌ ๋Œ€๋น„ ํŠน์ • ํด๋ž˜์Šค์— ์†ํ•  ํ™•๋ฅ ์„ ์ •๋ฆฌํ•จ.
์˜ˆ๋ฅผ ๋“ค์–ด ๋ฉ”์ผ์˜ ์ŠคํŒธ ๊ธฐ์ค€์„ ํŒ์ •ํ•œ๋‹ค๋ฉด, ์ŠคํŒธ๋ฉ”์ผ์— ๋ณต๊ถŒ์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ๋“ค์–ด์žˆ์„ ํ™•๋ฅ ์„ ์•ˆ๋‹ค๋ฉด ์ŠคํŒธ๋ฉ”์ผ์— ๋ณต๊ถŒ์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ๋“ค์–ด์žˆ์„ ๋•Œ ์ŠคํŒธ๋ฉ”์ผ์ผ ํ™•๋ฅ ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.
๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์ถ”์ • ํ™•๋ฅ ์„ ์‰ฝ๊ฒŒ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.
๋‹จ์ ์€ ๋ชจ๋“  ๋…๋ฆฝ๋ณ€์ˆ˜์˜ ์ƒ๊ด€์„ฑ์„ ๋ฌด์‹œํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐ๊ณผ๊ฐ€ ์™œ๊ณก๋  ์ˆ˜ ์žˆ๋‹ค.

๋ฐ์ดํ„ฐ๊ฐ€ ์ ์„ ๋•Œ๋‚˜ ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ ์ž์ฃผ ์‚ฌ์šฉํ•œ๋‹ค.

 

1) ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ

์‚ฌ๊ฑด A, B๊ฐ€ ์žˆ์„ ๋•Œ, B๊ฐ€ ์ผ์–ด๋‚œ ๋’ค์— A๊ฐ€ ์ผ์–ด๋‚œ ํ™•๋ฅ ์„ ๊ตฌํ•œ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ์ง€๊ธˆ ์•Œ๊ณ  ์žˆ๋Š” ๊ฒƒ์€ A, B๊ฐ€ ๊ฐ๊ฐ ์ผ์–ด๋‚  ํ™•๋ฅ , A๊ฐ€ ์ผ์–ด๋‚ฌ์„ ๋•Œ B๊ฐ€ ์ผ์–ด๋‚  ํ™•๋ฅ ์ด๋‹ค. ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๊ตฌํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ณต์‹์„ ๋งํ•œ๋‹ค.

 

์ด๋Ÿฐ ๊ฐ€์ •์ด ์žˆ๋‹ค๊ณ  ํ•ด๋ณด์ž

์–ธ๋œป ๋ณด๋ฉด ๋งž๋Š” ์†Œ๋ฆฌ ๊ฐ™์ง€๋งŒ, ๋‘ ์ง‘๋‹จ์˜ ๋ชจ์ง‘ํ•ฉ์ด ๋‹ค๋ฅด๋‹ค๋Š” ๊ฒƒ์„ ์ƒ๊ฐํ•ด ๋ณด๋ฉด ๊ธˆ๋ฐฉ ์˜ค๋ฅ˜๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์‚ฌ๊ณ ๋กœ ์‚ฌ๋งํ•œ ์‚ฌ๋žŒ์„ 100๋ช…์ด๋ผ๊ณ  ์น˜๋ฉด, 40๋ช…์€ ์•ˆ์ „๋ ๋ฅผ ๋งค์ง€ ์•Š์•„ ์ฃฝ์€ ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿผ 60๋ช…์€ ์•ˆ์ „๋ ๋ฅผ ๋งค๊ณ ์„œ๋„ ์ฃฝ์€ ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ „์ฒด ์šด์ „์ž ์ค‘์— ์•ˆ์ „๋ ๋ฅผ ๋งค์ง€ ์•Š์€ ์‚ฌ๋žŒ์€ ๋ช‡๋ช…์ผ๊นŒ? ๊ทธ๋ฆฌ๊ณ  ์•ˆ์ „๋ ๋ฅผ ๋งจ ์‚ฌ๋žŒ์€? ์ด ๋น„์œจ์„ ์•Œ๋ฉด ํ™•์‹คํ•˜๊ฒŒ ์˜ค๋ฅ˜๋ฅผ ์ง€์ ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

์œ„์˜ ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ์™€ ์—ฐ๊ฒฐ์ง€์–ด ์ƒ๊ฐํ–ˆ์„ ๋•Œ ์•ˆ์ „๋ ๋ฅผ ๋งค๋Š” ์•ˆ์ „๋ ๋ฅผ ๋งค๋Š” ์‚ฌ๊ฑด์„ P(A)๋ผ๊ณ  ํ•˜๊ณ  ์‚ฌ๊ณ ๋ฅผ ๋‹นํ•  ํ™•๋ฅ ์„ P(B)๋ผ๊ณ  ์ƒ๊ฐํ•ด ๋ณด์ž.

๊ทธ๋Ÿฌ๋ฉด P(A|B)๋Š” 0.6์ด ๋œ๋‹ค. ์ด๋•Œ ์งˆ๋ฌธ๋Œ€๋กœ ์šฐ๋ฆฌ๊ฐ€ ๊ถ๊ธˆํ•œ ๊ฒƒ์€ ์•ˆ์ „๋ ๋ฅผ ๋งธ์„ ๋•Œ ์ฃฝ์„ ํ™•๋ฅ  P(B|A) ์ด๋‹ค. ์ด๊ฒƒ์„ ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ ์ „์ฒด ์šด์ „์ž ์ค‘ 95%๊ฐ€ ์•ˆ์ „๋ ๋ฅผ ๋งธ๊ณ  5%๊ฐ€ ์•ˆ์ „๋ ๋ฅผ ๋งค์ง€ ์•Š์•˜๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด, 40๋ช…์€ 5%์˜ 40๋ช…์ด๊ณ  60๋ช…์€ 95%์˜ 60๋ช…์ธ ๊ฒƒ์ด๋‹ค. ๋˜ ์ „์ฒด ์šด์ „์ž 1๋งŒ๋ช… ์ค‘ 1๋ช… ๊ผด๋กœ ์‚ฌ๊ณ ๋ฅผ ๋‹นํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด ์ „์ž๋Š” 5000๋ช… ์ค‘ 40๋ช…์ด ์ฃฝ๋Š” ๊ฒƒ์ด๊ณ  ํ›„์ž๋Š” 95000๋ช… ์ค‘ 60๋ช…์ด ์ฃฝ๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿผ ํ™•๋ฅ ๋„ ์—„์ฒญ ์ฐจ์ด๊ฐ€ ๋‚˜๊ฒ ์ง€???

๊ทธ๋ฆฌ๊ณ  P(A)๋Š” 0.95, P(B)๋Š” 0.0001์ด ๋œ๋‹ค.

 

์ด์ œ ์œ„์˜ ๊ณต์‹์„ ์‚ฌ์šฉํ•ด์„œ ๊ณ„์‚ฐํ•ด๋ณด์ž. 

์•ˆ์ „๋ ๋ฅผ ๋งธ์„ ๋•Œ ์‚ฌ๊ณ ๋ฅผ ๋‹นํ•  ํ™•๋ฅ  P(B|A)๋Š” '(์‚ฌ๊ณ ๋ฅผ ๋‹นํ–ˆ์„ ๋•Œ ์•ˆ์ „๋ ๋ฅผ ๋งธ์„ ํ™•๋ฅ *์‚ฌ๊ณ ๋ฅผ ๋‹นํ•  ํ™•๋ฅ )/ ์•ˆ์ „๋ ๋ฅผ ๋งฌ ํ™•๋ฅ ' ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ 0.6*0.0001/0.95=0.000063์ด ๋œ๋‹ค.(์•ฝ 16000๋ช…๋‹น 1๋ช…)

๋ฐ˜๋Œ€๋กœ ์•ˆ์ „๋ ๋ฅผ ์•ˆ ๋งธ์„ ๋•Œ ์‚ฌ๊ณ ๋ฅผ ๋‹นํ•  ํ™•๋ฅ ์€ P(A)=0.5๋กœ ๊ณ„์‚ฐํ•˜๋ฉด ๋œ๋‹ค. ๊ณ„์‚ฐํ•˜๋ฉด 0.0008(1250๋ช…๋‹น 1๋ช…)

๊ทธ๋Ÿฌ๋ฏ€๋กœ ์•ˆ์ „๋ ๋ฅผ ๋งธ์„ ๋•Œ ์•ˆ์ „ํ•  ํ™•๋ฅ ์€ ์•ˆ์ „๋ ๋ฅผ ๋งค์ง€ ์•Š์•˜์„ ๋•Œ์— ๋น„ํ•ด 10๋ฐฐ ์ด์ƒ ๋†’๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

์ด ์›๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด์„œ, ๋‘ ์‚ฌ๊ฑด์ด ๋…๋ฆฝ์ ์œผ๋กœ ๋ฐœ์ƒํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ–ˆ์„ ๋•Œ ์‚ฌ์ „ ํ™•๋ฅ ๊ณผ ๊ฐ ์‚ฌ๊ฑด์˜ ํ™•๋ฅ ๋กœ ์‚ฌํ›„ ํ™•๋ฅ ์„ ์ถ”์ •ํ•ด ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

 

2) ๋ผํ”Œ๋ผ์Šค ์Šค๋ฌด๋”ฉ

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๋ฅผ ๋ณด์ •ํ•˜๋Š” ๊ธฐ๋ฒ•. ํŠน์ • ์‚ฌ๊ฑด์ด ์•„์˜ˆ ๋ฐœ๊ฒฌ๋˜์ง€ ์•Š์„ ๊ฒฝ์šฐ ๊ฐ’์ด 0์œผ๋กœ ๊ฒฐ๊ณผ๊ฐ’์ด ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋“  ๋ฐ์ดํ„ฐ์—์„œ ์‚ฌ๊ฑด์˜ ๋ฐœ์ƒ ํšŸ์ˆ˜์— ๊ธฐ๋ณธ๊ฐ’์„ ์ž…๋ ฅํ•ด ๋ณด์ •ํ•ด์ค€๋‹ค.

 

 

 

 

 

3) ์ข…๋ฅ˜ : ์ข…์†๋ณ€์ˆ˜ ํด๋ž˜์Šค์˜ ํ˜•์‹์ด๋‚˜ ๊ฐ’ ๋ถ„ํฌ์— ๋”ฐ๋ผ ๋‚˜๋‰œ๋‹ค.

 

์•„๋‹ˆ ๋ฒ ๋ฅด๋ˆ„์ด/๋‹คํ•ญ๋ถ„ํฌ/๊ฐ€์šฐ์‹œ์•ˆ์˜ ํ˜•์‹์„ ๋งํ•˜๋Š”๊ฒŒ ๋…๋ฆฝ๋ณ€์ˆ˜์•ผ, ์ข…์†๋ณ€์ˆ˜์•ผ?

์ฑ…์—์„œ๋Š” ๊ฐ€์šฐ์‹œ์•ˆ ๋งํ•  ๋•Œ ๋…๋ฆฝ๋ณ€์ˆ˜์— ์—ฐ์†ํ˜•, ์ข…์†๋ณ€์ˆ˜์— ๋ฒ”์ฃผํ˜•์„ ๋„ฃ์—ˆ์Œ. ๊ฐ€์šฐ์‹œ์•ˆ์€ ใ…‡๋…€์†ํ˜•๋ฐใ…ฃ์ดใ…Œ์ด๊ธฐ๋•Œ๋ฌธ์— 

๋˜‘๊ฐ™์ด ๋”ฐ์ง€๋ฉด ๋ฒ ๋ฅด๋ˆ„์ด๋Š” ๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ ์ด์‚ฐํ˜•์ด๊ณ  ์ข…์†๋ณ€์ˆ˜๊ฐ€ ๋ฒ”์ฃผํ˜•์ด๋ผ๋Š” ๋ง์ž„. 

๊ทผ๋ฐ ๋ฒ ๋ฅด๋ˆ„์ด์— ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์€ ๋…๋ฆฝ๋ณ€์ˆ˜์— ๋ฒ”์ฃผํ˜•๋ฐ์ดํ„ฐ ๋„ฃ๊ณ  ์ข…์†๋ณ€์ˆ˜์— ์ด์‚ฐํ˜•์„ ๋„ฃ์Œ. ๋ญ์ž„?

 

์‚ฌ์ดํŠธ๋ฅผ ๋ณด๋‹ˆ ๋ฒ ๋ฅด๋ˆ„์ด๋Š” ์˜ํ™”ํ‰๊ฐ€๋ฐ์ดํ„ฐ๋ฅผ ์“ฐ๋ฉด x์— ํ‰, y์— ๊ธ/๋ถ€์ • ์ด์‚ฐํ˜• ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด๊ฐ.

 

1-Bernouli

์ข…์†๋ณ€์ˆ˜๊ฐ€ ๋ฒ”์ฃผํ˜•(์ด์‚ฐํ˜•)์ธ ๊ฒฝ์šฐ(0, 1๋กœ ๋ถ„๋ฅ˜)

์ŠคํŒธ๋ฉ”์ผ ๊ตฌ๋ถ„ : ํŠน์ • ๋ฐ์ดํ„ฐ์˜ ์ถœํ˜„ ์—ฌ๋ถ€์— ๋”ฐ๋ผ ์ŠคํŒธ๋ฉ”์ผ์ธ์ง€ ์ •์ƒ์ธ์ง€ ๋ถ„๋ฅ˜

 

 

2-Multinomial

์ข…์†๋ณ€์ˆ˜๊ฐ€ ๋ฒ”์ฃผํ˜•์ธ ๊ฒฝ์šฐ. ๋ฐ์ดํ„ฐ์˜ ์ถœํ˜„ ํšŸ์ˆ˜์— ๋”ฐ๋ผ ๊ฐ’์„ ๋‹ฌ๋ฆฌํ•˜๋Š” ๊ฒฝ์šฐ.

์˜ํ™”๋ฆฌ๋ทฐ : ํŠน์ • ๋ฐ์ดํ„ฐ๋“ค์˜ ์ถœํ˜„ ํšŸ์ˆ˜์— ๋”ฐ๋ผ  ๊ธ์ •, ๋ถ€์ • ๋ถ„๋ฅ˜

 

 

3-Gussian

๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ ์—ฐ์†ํ˜•์ธ ๊ฒฝ์šฐ. ๋…๋ฆฝ๋ณ€์ˆ˜์˜ ๋ถ„ํฌ๊ฐ€ ์ •๊ทœ๋ถ„ํฌ์ธ ๊ฒƒ์„ ๊ฐ€์ •ํ•˜๊ณ  ๋ถ„์„

 

 

 

4) ์ฝ”๋“œ

 

1- Bernouli

๋ฐ์ดํ„ฐ๋Š” ์ŠคํŒธ ๋ฉ”์ผ ๋ถ„๋ฅ˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

import pandas as pd

spam = pd.read_csv('/Users/eve/Downloads/jupyter notebook/files/spam.csv', encoding='latin1')

spam.head()

spam.info()

v1์ด ์ข…์†๋ณ€์ˆ˜, v2๊ฐ€ ๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ ๋  ๊ฒƒ ๊ฐ™๋‹ค.

๊ฒฐ์ธก์น˜๋ฅผ ๋ณด๋‹ˆ 2,3,4์—ด์€ ๊ฑฐ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†์œผ๋ฏ€๋กœ ์ œ์™ธํ•œ๋‹ค. v2๋ฅผ ๋…๋ฆฝ๋ณ€์ˆ˜, v1์„ ์ข…์†๋ณ€์ˆ˜๋กœ ํ•ด์„œ ๋ถ„์„ํ•ด์•ผ๊ฒ ๋‹ค.

 

sklearn์˜ CounterVectorizer ํ•จ์ˆ˜๋ฅผ ์“ฐ๋ฉด ์—ฌ๋Ÿฌ ๋‹จ์–ด๊ฐ€ ๋“ค์–ด์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ชผ๊ฐœ์„œ ๊ฐ๊ฐ ๋ฒกํ„ฐ๋กœ ์ €์žฅํ•  ์ˆ˜ ์žˆ๋‹ค. 

์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์ธ์ฝ”๋”ฉ ํ•จ์ˆ˜๋ฅผ ์“ฐ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต(fit)ํ•œ ๋‹ค์Œ ํ›ˆ๋ จ/๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์— ๊ฐ๊ฐ transformํ•ด์ฃผ๋ฉด ๋‹จ์–ด๋ณ„๋กœ ์ชผ๊ฐค ์ˆ˜ ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•œ ๋’ค์—๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์ถœํ˜„ ํšŸ์ˆ˜์— ๋”ฐ๋ฅธ ์ˆซ์žํ˜•์œผ๋กœ ๋ณ€ํ™˜๋œ๋‹ค. ํ•œ๋ฒˆ ์ถœํ˜„ํ–ˆ์œผ๋ฉด 0, ๋‘๋ฒˆ ์ถœํ˜„ํ–ˆ์œผ๋ฉด 1 ์ด๋Ÿฐ ๋ฐฉ์‹์ด๋‹ค. 

๊ทธ๋Ÿฐ ๋’ค์— inverse_transform์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ณ€ํ™˜๋˜๊ธฐ ์ด์ „ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

spam_1 = spam.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'])

X = spam_1['v2']
y = spam_1['v1']

from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(X, y, train_size=0.7, random_state=1, stratify=y)



# ์ปฌ๋Ÿผ์˜ ๋ฐ์ดํ„ฐ ๋‹จ์–ด ๊ฐœ์ˆ˜๋งŒํผ ์ž˜๋ผ ๋ฒกํ„ฐ ์ƒ์„ฑ
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=True)

train_xcv = cv.fit_transform(train_y)

encoded_input = train_xcv.toarray()


inv = cv.inverse_transform(encoded_input)

print(inv[0])
#์ถœ๋ ฅ-> ['couple' 'down' 'give' 'me' 'minutes' 'my' 'sure' 'to' 'track' 'wallet' 'yeah']

 

๋ฒ ๋ฅด๋ˆ„์ด ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ํ•จ์ˆ˜๋ฅผ ์„ ์–ธํ•˜๊ณ  ํ•™์Šต/์˜ˆ์ธกํ•˜๊ธฐ

from sklearn.naive_bayes import BernoulliNB

bnb = BernoulliNB()

bnb.fit(train_xcv, train_y)

test_xcv = cv.transform(test_x)
pred = bnb.predict(test_xcv)

print("acc score : {0}%".format(round(bnb.score(test_xcv, test_y), 2)*100))

from sklearn.metrics import classification_report

cr = classification_report(test_y, pred)
print(cr)

์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ๋Š” ์šฐ์ˆ˜ํ•œ ๊ฒƒ ๊ฐ™๋‹ค. 

 

 

2-multinomial

import pandas as pd

imdb = pd.read_csv('/Users/eve/Downloads/jupyter notebook/files/์ž์—ฐ์–ด/IMDB Dataset.csv')

print(imdb.head())

 

๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ ๊ฐ์ƒํ‰, ์ข…์†๋ณ€์ˆ˜๊ฐ€ ๊ธ์ •/๋ถ€์ •์ด๋‹ค. 

X = imdb['review']
y = imdb['sentiment']

imdb['sentiment'] = np.where(imdb['sentiment']=='positive', 1, 0)


from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(X, y, train_size=0.5, random_state=1,
                                                   stratify=y)


from sklearn.feature_extraction.text import CountVectorizer


cv = CountVectorizer(binary=False)

train_xcv = cv.fit_transform(train_x)

 

๋‚ฑ๋ง ํ•˜๋‚˜ํ•˜๋‚˜๋ฅผ ๋ด์•ผ ํ•˜๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋Š” ํ•ญ์ƒ ๋‹จ์–ด๋“ค์„ ๋‹ค ๋ฒกํ„ฐ๋กœ ๋‚˜๋ˆ„์–ด์ค˜์•ผ ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’ ๋น„๊ตํ•ด๋ณด๊ณ  ์„ฑ๋Šฅํ‰๊ฐ€ํ•ด๋ณด๊ธฐ.

from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb.fit(train_xcv, train_y)

test_xcv = cv.transform(test_x)
pred = mnb.predict(test_xcv)

from sklearn.metrics import classification_report, confusion_matrix

cm = confusion_matrix(test_y, pred)
cr = classification_report(test_y, pred)

print(cr)

print(cm)

(*๊ทผ๋ฐ ์ด ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹คํ•ญ๋ถ„ํฌ๋กœ ๋ถ„์„ํ•˜๋Š”๊ฒŒ ๋งž๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ๋‹ค. ์ŠคํŒธ์ธ์ง€ ์•„๋‹Œ์ง€๋Š” ์˜ˆ๋ฅผ ๋“ค์–ด ๋ณต๊ถŒ์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ์žˆ๋Š”์ง€ ์•„๋‹Œ์ง€๋กœ ํŒ๋‹จํ•œ๋‹ค ์น˜๋ฉด ๋ชจ๋ธ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ๊ธฐ๋งŒ ํ•˜๋ฉด ๋ณต๊ถŒ์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ์ง€ํ‘œ๋ผ๊ณ  ํžŒํŠธ๋ฅผ ์•ˆ ์ค˜๋„ ๊ทธ๋ ‡๊ฒŒ ๋ถ„์„์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฑด๊ฐ€? ๋‹คํ•ญ๋ถ„ํฌ๋Š” ๋‹จ์–ด์˜ ํšŸ์ˆ˜๋ฅผ ์ธก์ •ํ•ด์„œ ํ•œ๋‹ค๋ฉด... ์ŠคํŒธ ์ฒ˜๋ฆฌํ•˜๋Š”๊ฑด ๋ชจ๋ฅด๊ฒ ๊ณ  ๋ฐ˜๋Œ€๋กœ ๋ฒ ๋ฅด๋ˆ„์ด๋กœ ๊ฐ์ƒํ‰ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์ง€ ์•Š๋‚˜? ๊ธ์ •์ธ ํ‰์—๋Š” ๊ฒน์น˜๋Š” ๋‹จ์–ด๋“ค์ด ๋งŽ์„ ํ…Œ๋‹ˆ๊นŒ. ๋‹คํ•ญ๋ถ„ํฌ์— ๋Œ€ํ•ด์„œ ์กฐ๊ธˆ ๋” ๊ณต๋ถ€ํ•  ํ•„์š”์„ฑ์ด ์žˆ๊ฒ ๋‹ค. ๋‹จ์–ด ๋ฐ˜๋ณต ํšŸ์ˆ˜ ์ธก์ •ํ•ด์„œ ์˜ˆ์ธกํ•œ๋‹ค๋Š”๊ฑฐ ๋ง๊ณค ์›๋ฆฌ๋ฅผ ๋ชจ๋ฅด๋‹ˆ๊นŒ ์ดํ•ด๊ฐ€ ์•ˆ ๋˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค)

 

 

 

3- Gussian

์šฐ์ฃผ ๊ด€์ธก๊ฒฐ๊ณผ์— ๋”ฐ๋ผ ์€ํ•˜์ธ์ง€ ๋ณ„์ธ์ง€ ๋ถ„๋ฅ˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ํ•™์Šตํ•  ๋•Œ ๋ณ„์ด๋‚˜ ์€ํ•˜๊ฐ€ ์–ด๋–ค ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€ ํ•™์Šตํ•˜๋ฉด ์˜ˆ์ธก ์‹œ ์ด๋Ÿฐ ํŠน์„ฑ์ผ ๋•Œ ๋ณ„์ธ์ง€ ์€ํ•˜์ธ์ง€ ์•Œ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sky = pd.read_csv('/Users/eve/Downloads/jupyter notebook/files/classification/Skyserver.csv')

sky.head()

์ข…์†๋ณ€์ˆ˜๊ฐ€ class์ด๊ณ  ๋ฒ”์ฃผ๋Š” STAR, GALAXY, Q50์ด๋‹ค. ๋‚˜๋จธ์ง€๋Š” ๋…๋ฆฝ๋ณ€์ˆ˜์ด๊ณ  ๋ชจ๋‘ ์—ฐ์†ํ˜•์ด๋ฏ€๋กœ ๊ฐ€์šฐ์‹œ์•ˆ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ๋ฅผ ์ ์šฉํ•ด์•ผ ํ•œ๋‹ค.

from sklearn.model_selection import train_test_split

X = sky.drop(columns=['class'])
y = sky['class']

train_x, test_x, train_y, test_y = train_test_split(X, y, train_size=0.7,
                                                   random_state=1)

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

pred = gnb.fit(train_x, train_y).predict(test_x)

print('acc score :\t{0}%'.format(round(gnb.score(test_x, test_y), 2)))

 

์ •ํ™•๋„๋Š” 0.8์ด๋‹ค.