-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Just came across this repo, it's 👍
I work with Chinese data in R and Python, and have struggled often with encoding issues in R, so figured I'd share some of the things I've come across and learned in the process. Most of the stuff below is related to file system functions and identifying files. Feel free to add any of this to the repo if you'd like (or not, it's all good).
list.files() versus Sys.glob()
list.files() fails at preserving Chinese chars, use Sys.glob() instead. Here's an example:
# Create a new file with Chinese chars in the file name.
temp_dir <- tempdir()
file_conn <- file(paste0(temp_dir, "/假文件名.txt"))
writeLines("cats", file_conn)
close(file_conn)
# :-(
files <- list.files(temp_dir, pattern = "*.txt")
files
#> [1] "????.txt"
# :-)
files <- Sys.glob(file.path(temp_dir, "*.txt"))
unlist(strsplit(files, "/"))[2]
#> [1] "假文件名.txt"
# Clean-up
unlink(temp_dir)Issue with Chinese Parenthesis Chars in File Names
library(fs)
# This is a file that exists in my current working directory, it contains
# Chinese parenthesis.
file_name <- "假文件名(12家).txt"
# Checking if it exists fails.
file.exists(file_name)
#> [1] FALSE
fs::is_file(file_name)
#> 假文件名(12家).txt
#> FALSE
## Issue is that the functions above are treating the Chinese paren chars
## in object "file_name" as English paren chars.
## Use conversion between utf8 and int to facilitate ID'ing the file.
# Function to take in a string that contains parenthesis chars and
# replaces them with Chinese parenthesis chars (as ints).
cn_paren <- function(x) {
x_int <- utf8ToInt(x)
x_int[x_int == 40] <- 65288
x_int[x_int == 41] <- 65289
intToUtf8(x_int)
}
file_name_cn <- cn_paren(file_name)
# Test for the existence, now works.
file.exists(file_name_cn)
#> [1] TRUE
fs::is_file(file_name_cn)
#> 假文件名(12家).txt
#> TRUE
# Can see a slight difference in the parenthesis chars in the two strings
# when printed.
file_name
#> [1] "假文件名(12家).txt"
file_name_cn
#> [1] "假文件名(12家).txt"
# We also see the difference when using ust8ToInt()
utf8ToInt(file_name)
#> [1] 20551 25991 20214 21517 40 49 50 23478 41 46 116 120 116
utf8ToInt(file_name_cn)
#> [1] 20551 25991 20214 21517 65288 49 50 23478 65289 46 116 120 116Package fs
The fs package is great, there's been a few times where it's been able to ID a file on my PC for which base functions have failed. I often use fs::is_file() in place of base::file.exists(), and fs::file_copy() in place of base::file.copy().
This Kevin Ushey Blog Post
This blog post by Kevin Ushey on string encoding in R is fantastic (and the comments are full of info as well).
System Info
And here's my system/local info
Sys.getlocale()
#> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
getOption("encoding")
#> [1] "native.enc"
sessionInfo()
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows >= 8 x64 (build 9200)
#> Matrix products: default
#> locale:
#> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#> loaded via a namespace (and not attached):
#> [1] compiler_3.5.0 tools_3.5.0 fs_1.2.3 yaml_2.1.19 Rcpp_0.12.17