-
Notifications
You must be signed in to change notification settings - Fork 163
Description
Dear massimoaria,
i am currently trying to import retrieved data from both Open Alex and Scopus into bibliometrix. Using the convert2df didn't work, so I even tried to rename some columns of the dataset according to the requested format in your documentation which also didn't work. In the biblioshiny interface I did the following steps: data < data import or load < import raw file's < Scopus < surname and initials. Then when I upload my file I either get an error message telling me that it contains 0 rows, or it loads for ages without any result.
Therefore I provide you with the code below and hope that you can help me with this issue. . Also I had a similar issue with Open Alex which I could then solve by removing rows of the dataset. This is not my goal though as I would rather keep the rows as they consist of data I would actually want to import to bibliometrix for an overview/analysis there. In case you need that code as well please come back to me.
Code Scopus:
(here would be a part where I retrieve the data from Scopus via query, this works totally fine, I just didn't want to provide the search string etc. as we will use it for a project)
scopus_xml <- searchByString(
string = scopus_query,
outfile = here("local_data/scopus_API_export.xml"))
response <- httr::GET("https://api.elsevier.com/content/search/scopus",
query = list(apiKey = scopus_api_key, query = "test"))
httr::content(response, "text")
## Checking Quality and Duplicates
```{r}
# convert xml to dataframe
scopus <- list(
xml = scopus_xml,
api = extractXML(scopus_xml)
)
# quality control
scopus$api %>%
skimr::skim()
#remove duplicates
# Check for duplicates based on DOI
scopus$api %>%
filter(!is.na(doi)) %>% # exclude cases without DOI
group_by(doi) %>%
summarise(n = n()) %>%
frq(n, sort.frq = "desc")
duplicates <- list()
# Extract duplicated IDs
duplicates$api$doi$string <- scopus$api %>%
filter(!is.na(doi)) %>%
group_by(doi) %>%
summarise(n = n()) %>%
filter(n > 1) %>%
pull(doi)
# Extract cases with duplicated IDs
duplicates$api$doi$data <- scopus$api %>%
filter(doi %in% duplicates$api$doi$string)
Cleaning (remove Duplicates)
# remove duplicates based on DOI
scopus$raw <- scopus$api %>%
distinct(doi, .keep_all = TRUE)
scopus$raw %>%
filter(!is.na(doi)) %>% # exclude cases without DOI
group_by(doi) %>%
summarise(n = n()) %>%
frq(n, sort.frq = "desc")
# save as Rdata for safety reasons
save(scopus, file = here("local_data", "scopus_data.RData"))
dir.create("e:/02_leuner/PROMISE_WP1/PROMISE_WP1/local_data", recursive = TRUE, showWarnings = FALSE)
temporary csv file for bibliometrix
bibliometrix::biblioshiny()
install.packages("XML")
library(XML)
bibliometrix::convert2df(here("local_data", "scopus_raw.bib"), dbsource = "scopus", format = "bibtex")
# error message "Error in data.frame() arguments imply differing number of rows: 0, 1"
# save as csv
write.csv(scopus$raw, here("local_data", "scopus_temp.csv"), row.names = FALSE)
# error message bibliometrix: replacement has 0 rows, data has 4732
#save as text
writeLines(unlist(scopus$raw), here("local_data/scopus_raw.bib"))
scopus_bibtex <- load(here("local_data/scopus_raw.bib"))
# error message bibliometrix: #Error: arguments imply differing number of rows: 0, 1
#analysis why import to biliometrix doesn't work
# find NAs in scpous$raw
scopus$raw[apply(scopus$raw, 1, function(x) any(is.na(x))), ]
scopus$raw %>%
skimr::skim()
scopus$raw$pmid
scopus$test <- scopus$raw
scopus$test %>%
filter(is.na(doi)) %>%
group_by(doi) %>%
summarise(n = n()) %>%
frq(n, sort.frq = "desc")
scopus$test %>%
skimr::skim()
which(is.na(scopus$test$doi))
scopus$test[350, ]
#remove it because doi is NA
scopus$test <- scopus$test[-c(350), ]
scopus$test %>%
skimr::skim()
#didn't work in bibliometrix either, so now we will change names of rows in test
#articletitle to document title
#
scopus$test <- scopus$test %>%
rename(documenttitle = articletitle)
scopus$test <- scopus$test %>%
rename(documenttitle = articletitle)
#didn't work either
# 1. rename columns
scopus$test <- scopus$test %>% rename(pm = pmid)
scopus$test <- scopus$test %>% rename(
Authors = authors,
Title = documenttitle,
Year = year,
`Source title` = journal,
Volume = volume,
Issue = issue,
Pages = pages,
DOI = doi,
Affiliations = affiliations,
Abstract = abstract,
`Document Type` = ptype
)
scopus$test <- scopus$test %>% rename(Keywords = keywords)
# 2. Export und Quick-Check of CSV file
write.csv(scopus$test, "scopus_test.csv", row.names = FALSE, fileEncoding = "UTF-8")
readLines("scopus_test.csv", n = 3)
# 3. Choosing of relevant cols and cleaning
scopus_biblio <- scopus$test %>% select(
Authors, Title, Year, `Source title`, Volume, Issue, Pages,
DOI, Affiliations, Abstract, `Document Type`, Keywords
)
scopus_biblio[scopus_biblio == "NA"] <- ""
scopus_biblio[is.na(scopus_biblio)] <- ""
# 4. export with and without ""
write.csv(scopus_biblio, "scopus_biblio.csv", row.names = FALSE, fileEncoding = "UTF-8", quote = FALSE)
write.csv(scopus_biblio, "scopus_biblio.csv", row.names = FALSE, fileEncoding = "UTF-8", quote = TRUE)
# 5. check number of columns
lines <- readLines("scopus_biblio.csv")
splits <- strsplit(lines, ",")
num_cols <- sapply(splits, length)
table(num_cols)
which(num_cols != 12)
# 6. import and rename columns into scopus format for bibliometrix
df <- read.csv("scopus_biblio.csv", fileEncoding = "UTF-8", stringsAsFactors = FALSE)
colnames(df) <- c("AU", "TI", "PY", "SO", "VL", "IS", "BP", "DI", "C1", "AB", "DT", "DE")
write.csv(df, "scopus_biblio_scopusformat.csv", row.names = FALSE, fileEncoding = "UTF-8", quote = TRUE)
# 7. qualitycheck
df$C1 <- gsub("\\|", ";", df$C1)
df$C1[df$C1 == "" | is.na(df$C1)] <- "Unknown"
df$AU <- gsub("\\|", ";", df$AU)
df$DE <- gsub("\\|", ";", df$DE)
df[is.na(df)] <- "Unknown"
df[df == ""] <- "Unknown"
# 8.Export for Bibliometrix
write.csv(df, "export_scopus.csv", row.names = FALSE, fileEncoding = "UTF-8", quote = TRUE)
write.table(df, "export_scopus_clean.csv",
sep = ",",
row.names = FALSE,
col.names = TRUE,
quote = TRUE,
fileEncoding = "UTF-8")
# 9. formatting control & test dataset
lines <- readLines("export_scopus_clean.csv", encoding = "UTF-8")
num_commas <- sapply(lines, function(x) stringr::str_count(x, ","))
which(num_commas != 12)
readLines("export_scopus_clean.csv", n = 2)
sum(is.na(df))
str(df)
write.csv(df[1:10, ], "test_scopus.csv", row.names = FALSE, fileEncoding = "UTF-8", quote = TRUE)