PM

모델: KPFBERT (SNU NLP · KR-FinBERT SC 사전학습 가중치 → 금융 도메인 파인튜닝)

항목	설명
사전학습 코퍼스	한국어 금융 공시·뉴스·보고서 11 GB + BERT 한국어 일반 말뭉치
파인튜닝 데이터	• KoSELF 2.0 ± Lexicon 15 K • KNU 한국어 감성사전 7 K • 병합 merged_lexicon(긍 7,474 / 부 6,984 / 중 1,642)
장점	- 금융 도메인 특화 토큰 분포 - BERT 계열 가중치 재활용 → 수렴 빠름 - Sigmoid – P(pos)-P(neg) 스코어 직접 출력

데이터 파이프라인

단계	구현 세부	비고
수집	ETF/증권 뉴스 3만 건 CSV (증권사 API·RSS) + 추가 구글뉴스 RSS 크롤링	불균형 보완
필터링	HTML 태그, URL, 이메일, 특수문자 정규식 제거 → 다중 공백 축소	`clean(x)`
중복 제거	제목 + url 해시 비교	3.2 % 중복
사전 라벨링	KoSELF(±1), KNU(+1/-1/0) → `polarity` 가중 평균	다른 사전 점수가 있으면 가중치 0.7 : 0.3
라벨 인코딩	neg 0, neu 1, pos 2	`label2id`
Train/Val split	`sklearn.train_test_split` stratify=label, val 5 %	seed 42
HF Dataset	`Dataset.from_pandas({"text","label"})` → map(tokenize)	WordPiece, max_len 128

모델/학습 설정

구분	값
백본	`snunlp/KR-FinBERT-SC` (BERT-Base 110 M)
헤드	3-way Softmax (`num_labels=3`)
옵티마이저	`torch_optimizer.RAdam`, lr 1e-5, weight_decay 0.01
배치	train 64 / eval 64
Epoch	4 epoch (Validation loss 최소)
스케줄러	linear, no warm-up
점수	`score = P(pos) – P(neg) ∈ [-1,1]`0 ↔ ±0.01 보정
추론 파이프	Uvicorn+FastAPI → `/sentiment` POST `[title]` → `[{"score": …}]`

감성 점수 산출 로직

입력 title 전처리(clean)
BERT 토크나이즈(max_len 128)
Softmax logits → 확률 [NEG, NEU, POS]
score = prob[POS] – prob[NEG]
보정: 0 < score < 0.01 ⇒ 0.01, -0.01 < score < 0 ⇒ -0.01
소수 2째 자리 round(x,2) 반환

API 실행환경

항목	설정
컨테이너	Docker (Ubuntu 20.04 slim)
서버	FastAPI 0.111 + Uvicorn 0.29, `--workers 4`
메모리	EC2 `t2.micro` (2 GiB) + Swap 10 GiB
헬스체크	`/health` → `{"status":"ok"}`
예시 호출	POST `/sentiment` `["ETF 수익률 급등","경기침체 우려"]` → `[{"score":0.71},{"score":-0.42}]`

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Dockerfile		Dockerfile
KOSELF-finetune-merged.csv		KOSELF-finetune-merged.csv
README.md		README.md
SentiWord_info.json		SentiWord_info.json
config.json		config.json
finetune_finbert.py		finetune_finbert.py
main.py		main.py
merged_lexicon.csv		merged_lexicon.csv
models.py		models.py
prep_lexicon_ds.py		prep_lexicon_ds.py
requirements.txt		requirements.txt
sentiment.py		sentiment.py
special_tokens_map.json		special_tokens_map.json
test.py		test.py
tokenizer.json		tokenizer.json
tokenizer_config.json		tokenizer_config.json
training_args.bin		training_args.bin
vocab.txt		vocab.txt
뉴스감성변환.py		뉴스감성변환.py
크롤링.ipynb		크롤링.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PM

데이터 파이프라인

모델/학습 설정

감성 점수 산출 로직

API 실행환경

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Newgnal/PM-DA

Folders and files

Latest commit

History

Repository files navigation

PM

데이터 파이프라인

모델/학습 설정

감성 점수 산출 로직

API 실행환경

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages