빅데이터분석기사 실기 가이드

Key Claims

1유형: 데이터 조작

주요 유형: 데이터 타입 확인·변환, 기초통계량, 인덱싱/필터링/정렬, 결측치·이상치 처리, 스케일링(표준화/정규화), 데이터 합치기, 날짜/시간 처리.¹
핵심 pandas 명령어: .dtypes, .astype(), .mean()/.median()/.mode(), .quantile(), .isnull().sum(), .fillna(), .dropna(), .drop_duplicates().¹
IQR 계산: Q1 = df['col'].quantile(0.25); Q3 = df['col'].quantile(0.75); IQR = Q3 - Q1.¹

2유형: 머신러닝 파이프라인

분석 순서: 라이브러리/데이터 확인 → EDA(데이터 타입·결측치·기초통계량) → 전처리 및 분리 → 모델링 및 성능 평가 → 예측값 제출.²
분류 모델: RandomForestClassifier. 평가지표: Accuracy, F1 score(macro), AUC.²
회귀 모델: RandomForestRegressor. 평가지표: R², MSE, RMSE.²
원핫 인코딩 후 train/test 컬럼 불일치 처리: x_train.reindex(columns=x_test.columns, fill_value=0).²

3유형: 통계 검정·회귀

가설검정 순서: 가설 설정(귀무/대립) → 유의수준 설정(통상 α=0.05) → 검정통계량 계산 → p-value 계산 → 채택/기각 결정.³
p-value > 0.05: 귀무가설 채택. p-value ≤ 0.05: 귀무가설 기각.³
회귀 성능: R²(결정계수) = SSR/SST = 1 - SSE/SST. 범위 0~1, 높을수록 좋음. 독립변수 수 증가 시 R² 증가 → 조정 R² 사용.⁴
정규화 회귀: Ridge(L2 규제, 계수가 0에 가까워짐), Lasso(L1 규제, 계수가 0이 됨 - 변수 선택 효과), Elastic Net(Ridge+Lasso).⁴
다중공선성: VIF ≥ 10이면 존재. 해당 독립변수 제거 필요.⁴

Examples / Code

2유형 전체 파이프라인:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
 
# 전처리
x = df.drop(columns=['target'])
y = df['target']
x = pd.get_dummies(x)
 
# 분리
x_train, x_val, y_train, y_val = train_test_split(
    x, y, stratify=y, random_state=2023, test_size=0.2
)
 
# 모델링
model = RandomForestClassifier()
model.fit(x_train, y_train)
 
# 평가
y_pred = model.predict(x_val)
print(accuracy_score(y_val, y_pred))
print(f1_score(y_val, y_pred, average='macro'))
 
# 최종 예측 제출
y_result = model.predict(x_test)
pd.DataFrame({'pred': y_result}).to_csv('result.csv', index=False)

Footnotes

content/Study/빅분기/2023-11-27-빅분기-실기-1유형.md ↩ ↩² ↩³
content/Study/빅분기/2023-11-30-빅분기-실기-2유형.md ↩ ↩² ↩³ ↩⁴
content/Study/빅분기/2023-12-02-빅분기-실기-3유형.md ↩ ↩²
content/Study/빅분기/2023-09-22-빅분기.md ↩ ↩² ↩³

Key Claims

1유형: 데이터 조작

2유형: 머신러닝 파이프라인

3유형: 통계 검정·회귀

Examples / Code

Footnotes

Footnotes

Linked from (1)