Skip to content

e9t/nsmc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Naver sentiment movie corpus v1.0

This is a movie review dataset in the Korean language. Reviews were scraped from Naver Movies.

The dataset construction is based on the method noted in Large movie review dataset from Maas et al., 2011.

Data description

  • Each file is consisted of three columns: id, document, label
    • id: The review id, provieded by Naver
    • document: The actual review
    • label: The sentiment class of the review. (0: negative, 1: positive)
    • Columns are delimited with tabs (i.e., .tsv format; but the file extension is .txt for easy access for novices)
  • 200K reviews in total
    • ratings.txt: All 200K reviews
    • ratings_test.txt: 50K reviews held out for testing
    • ratings_train.txt: 150K reviews for training

Characteristics

  • All reviews are shorter than 140 characters
  • Each sentiment class is sampled equally (i.e., random guess yields 50% accuracy)
    • 100K negative reviews (originally reviews of ratings 1-4)
    • 100K positive reviews (originally reviews of ratings 9-10)
    • Neutral reviews (originally reviews of ratings 5-8) are excluded

Quick peek

$ head ratings_train.txt
id      document        label
9976970 μ•„ 더빙.. μ§„μ§œ μ§œμ¦λ‚˜λ„€μš” λͺ©μ†Œλ¦¬        0
3819312 흠...ν¬μŠ€ν„°λ³΄κ³  μ΄ˆλ”©μ˜ν™”μ€„....μ˜€λ²„μ—°κΈ°μ‘°μ°¨ 가볍지 μ•Šκ΅¬λ‚˜        1
10265843        λ„ˆλ¬΄μž¬λ°“μ—ˆλ‹€κ·Έλž˜μ„œλ³΄λŠ”κ²ƒμ„μΆ”μ²œν•œλ‹€      0
9045019 κ΅λ„μ†Œ 이야기ꡬ먼 ..μ†”μ§νžˆ μž¬λ―ΈλŠ” μ—†λ‹€..평점 μ‘°μ •       0
6483659 사이λͺ¬νŽ˜κ·Έμ˜ μ΅μ‚΄μŠ€λŸ° μ—°κΈ°κ°€ λ‹λ³΄μ˜€λ˜ μ˜ν™”!μŠ€νŒŒμ΄λ”λ§¨μ—μ„œ λŠ™μ–΄λ³΄μ΄κΈ°λ§Œ ν–ˆλ˜ μ»€μŠ€ν‹΄ λ˜μŠ€νŠΈκ°€ λ„ˆλ¬΄λ‚˜λ„ μ΄λ»λ³΄μ˜€λ‹€  1
5403919 막 걸음마 λ—€ 3μ„ΈλΆ€ν„° μ΄ˆλ“±ν•™κ΅ 1학년생인 8μ‚΄μš©μ˜ν™”.γ…‹γ…‹γ…‹...λ³„λ°˜κ°œλ„ μ•„κΉŒμ›€.     0
7797314 μ›μž‘μ˜ κΈ΄μž₯감을 μ œλŒ€λ‘œ μ‚΄λ €λ‚΄μ§€λͺ»ν–ˆλ‹€.  0
9443947 별 λ°˜κ°œλ„ 아깝닀 μš•λ‚˜μ˜¨λ‹€ 이응경 길용우 μ—°κΈ°μƒν™œμ΄λͺ‡λ…„인지..정말 λ°œλ‘œν•΄λ„ 그것보단 λ‚«κ²Ÿλ‹€ λ‚©μΉ˜.감금만반볡반볡..μ΄λ“œλΌλ§ˆλŠ” 가쑱도없닀 μ—°κΈ°λͺ»ν•˜λŠ”μ‚¬λžŒλ§Œλͺ¨μ—Ώλ„€       0
7156791 μ•‘μ…˜μ΄ μ—†λŠ”λ°λ„ 재미 μžˆλŠ” λͺ‡μ•ˆλ˜λŠ” μ˜ν™” 1

License

CC0

About

Naver sentiment movie corpus

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages