728x90
๋ฐ˜์‘ํ˜•
๐Ÿ’ก
EDA(Exploratory Data Analysis) : ๋ฐ์ดํ„ฐ์—์„œ ๋ถ„์„์— ํ•„์š”ํ•œ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํ†ต๊ณ„๋Ÿ‰์„ ๊ณ„์‚ฐํ•˜๊ณ , ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด์„œ ํ™•์ธํ•˜๋Š” ์ž‘์—….
  • ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„ (Exploratory Data Analysis, EDA)์€ ๋ฐ์ดํ„ฐ์™€ ์นœํ•ด์ง€๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
  • ๋ถ„์„์„ ํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ์—์„œ ํ™•์ธํ•˜๊ณ  ์‹ถ์€ ์ •๋ณด๋“ค์„ ํ™•์ธํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
  • ์ •๋‹ต์ฒ˜๋Ÿผ ๊ทœ์น™์ฒ˜๋Ÿผ ์ •ํ•ด์ง„ ํ”„๋กœ์„ธ์Šค๊ฐ€ ๋”ฐ๋กœ ์—†๊ณ , ๋ถ„์„๊ฐ€๋“ค๋งˆ๋‹ค ๋ฐฉ๋ฒ•๋ก ์ด ์กฐ๊ธˆ์”ฉ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.
  • ์–ด๋–ค ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋А๋ƒ์— ๋”ฐ๋ผ์„œ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋ก ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ๋งŽ์ด ์•Œ ์ˆ˜๋ก EDA๋„ ์ž˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (Domain Knowledge)
  • ๋‚˜๋งŒ์˜ EDA process๋ฅผ ๋งŒ๋“ค ์ˆ˜๋ก, Data Scientist๋กœ์„œ ์—ญ๋Ÿ‰์ด ๊ฐ–์ถ”์–ด ์ง„๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

์„ค๋ช…์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด์„œ Iris dataset์ด ๋ชจ๋“ˆ2๊นŒ์ง€ ๊ฑฐ์ณ์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋˜์—ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด๋ด…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ํฌ๊ธฐ ํ™•์ธ

  • ์ฃผ์–ด์ง„ Iris dataset์˜ ํฌ๊ธฐ๋Š” 150 rows, 6 columns ์ž…๋‹ˆ๋‹ค. (150 x 6)
  • pandas๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์€ ๋Œ€๋žต 7.2KB ์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋ถ„ํฌ ํ™•์ธ

  • target distribution
  • Petal Length VS Petal Width
  • Feature Histogram

๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”

  • Pairplot
  • Boxplot

ํ†ต๊ณ„๋Ÿ‰ ๋ถ„์„

  • Correlation Matrix

Reference

  1. https://www.kaggle.com/benhamner/python-data-visualizations
  1. https://www.kaggle.com/ash316/ml-from-scratch-with-iris
  1. https://www.kaggle.com/upadorprofzs/basic-visualization-techniques


Hands-on

  • Data Source :
Iris Species
Classify iris plants into three species in this classic dataset
https://www.kaggle.com/uciml/iris
  1. Excel์„ ์ด์šฉํ•ด์„œ ๊ฐ column๋ณ„ ํ‰๊ท , ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๊ณ„์‚ฐํ•ด๋ณด์„ธ์š”.
  1. Excel์„ ์ด์šฉํ•ด์„œ target distribution์„ ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ ค๋ณด์„ธ์š”.
728x90
๋ฐ˜์‘ํ˜•

+ Recent posts