Skip to content

Fail to import data from scopus (and Open Alex) #560

@eziegltrum

Description

@eziegltrum

Dear massimoaria,
i am currently trying to import retrieved data from both Open Alex and Scopus into bibliometrix. Using the convert2df didn't work, so I even tried to rename some columns of the dataset according to the requested format in your documentation which also didn't work. In the biblioshiny interface I did the following steps: data < data import or load < import raw file's < Scopus < surname and initials. Then when I upload my file I either get an error message telling me that it contains 0 rows, or it loads for ages without any result.
Therefore I provide you with the code below and hope that you can help me with this issue. . Also I had a similar issue with Open Alex which I could then solve by removing rows of the dataset. This is not my goal though as I would rather keep the rows as they consist of data I would actually want to import to bibliometrix for an overview/analysis there. In case you need that code as well please come back to me.

Code Scopus:

(here would be a part where I retrieve the data from Scopus via query, this works totally fine, I just didn't want to provide the search string etc. as we will use it for a project)

scopus_xml <- searchByString(
string = scopus_query,
outfile = here("local_data/scopus_API_export.xml"))

response <- httr::GET("https://api.elsevier.com/content/search/scopus",
query = list(apiKey = scopus_api_key, query = "test"))
httr::content(response, "text")


## Checking Quality and Duplicates


```{r}
# convert xml to dataframe
scopus <- list(
  xml = scopus_xml, 
  api = extractXML(scopus_xml)
  )

# quality control 
scopus$api %>% 
  skimr::skim()

#remove duplicates 

# Check for duplicates based on DOI
scopus$api %>% 
  filter(!is.na(doi)) %>% # exclude cases without DOI
  group_by(doi) %>% 
  summarise(n = n()) %>% 
  frq(n, sort.frq = "desc")

duplicates <- list()

# Extract duplicated IDs
duplicates$api$doi$string <- scopus$api %>%  
  filter(!is.na(doi)) %>%
  group_by(doi) %>%
  summarise(n = n()) %>%
  filter(n > 1) %>% 
  pull(doi)

# Extract cases with duplicated IDs
duplicates$api$doi$data <- scopus$api %>% 
  filter(doi %in% duplicates$api$doi$string)

Cleaning (remove Duplicates)

# remove duplicates based on DOI
scopus$raw <- scopus$api %>% 
  distinct(doi, .keep_all = TRUE)

scopus$raw %>% 
  filter(!is.na(doi)) %>% # exclude cases without DOI
  group_by(doi) %>% 
  summarise(n = n()) %>% 
  frq(n, sort.frq = "desc")

# save as Rdata for safety reasons 
save(scopus, file = here("local_data", "scopus_data.RData"))
dir.create("e:/02_leuner/PROMISE_WP1/PROMISE_WP1/local_data", recursive = TRUE, showWarnings = FALSE)

temporary csv file for bibliometrix

bibliometrix::biblioshiny()
install.packages("XML")
library(XML)

bibliometrix::convert2df(here("local_data", "scopus_raw.bib"), dbsource = "scopus", format = "bibtex")
# error message "Error in data.frame() arguments imply differing number of rows: 0, 1"


# save as csv
write.csv(scopus$raw, here("local_data", "scopus_temp.csv"), row.names = FALSE)
# error message bibliometrix: replacement has 0 rows, data has 4732

#save as text 
writeLines(unlist(scopus$raw), here("local_data/scopus_raw.bib"))
scopus_bibtex <- load(here("local_data/scopus_raw.bib"))
# error message bibliometrix: #Error: arguments imply differing number of rows: 0, 1

#analysis why import to biliometrix doesn't work
# find NAs in scpous$raw

scopus$raw[apply(scopus$raw, 1, function(x) any(is.na(x))), ]

scopus$raw %>% 
  skimr::skim()

scopus$raw$pmid

scopus$test <- scopus$raw 

scopus$test %>%
  filter(is.na(doi)) %>% 
    group_by(doi) %>% 
  summarise(n = n()) %>% 
  frq(n, sort.frq = "desc")

scopus$test %>%
    skimr::skim()

which(is.na(scopus$test$doi))

scopus$test[350, ]

#remove it because doi is NA
scopus$test <- scopus$test[-c(350), ]

scopus$test %>%
    skimr::skim()

#didn't work in bibliometrix either, so now we will change names of rows in test
#articletitle to document title 
# 
scopus$test <- scopus$test %>% 
  rename(documenttitle = articletitle)

scopus$test <- scopus$test %>% 
  rename(documenttitle = articletitle)

#didn't work either 

# 1. rename columns
scopus$test <- scopus$test %>% rename(pm = pmid)
scopus$test <- scopus$test %>% rename(
  Authors = authors,
  Title = documenttitle,
  Year = year,
  `Source title` = journal,
  Volume = volume,
  Issue = issue,
  Pages = pages,
  DOI = doi,
  Affiliations = affiliations,
  Abstract = abstract,
  `Document Type` = ptype
)
scopus$test <- scopus$test %>% rename(Keywords = keywords)

# 2. Export und Quick-Check of CSV file
write.csv(scopus$test, "scopus_test.csv", row.names = FALSE, fileEncoding = "UTF-8")
readLines("scopus_test.csv", n = 3)

# 3. Choosing of relevant cols and cleaning
scopus_biblio <- scopus$test %>% select(
  Authors, Title, Year, `Source title`, Volume, Issue, Pages,
  DOI, Affiliations, Abstract, `Document Type`, Keywords
)
scopus_biblio[scopus_biblio == "NA"] <- ""
scopus_biblio[is.na(scopus_biblio)] <- ""

# 4. export with and without ""
write.csv(scopus_biblio, "scopus_biblio.csv", row.names = FALSE, fileEncoding = "UTF-8", quote = FALSE)
write.csv(scopus_biblio, "scopus_biblio.csv", row.names = FALSE, fileEncoding = "UTF-8", quote = TRUE)

# 5. check number of columns
lines <- readLines("scopus_biblio.csv")
splits <- strsplit(lines, ",")
num_cols <- sapply(splits, length)
table(num_cols)
which(num_cols != 12)

# 6. import and rename columns into scopus format for bibliometrix
df <- read.csv("scopus_biblio.csv", fileEncoding = "UTF-8", stringsAsFactors = FALSE)
colnames(df) <- c("AU", "TI", "PY", "SO", "VL", "IS", "BP", "DI", "C1", "AB", "DT", "DE")
write.csv(df, "scopus_biblio_scopusformat.csv", row.names = FALSE, fileEncoding = "UTF-8", quote = TRUE)

# 7. qualitycheck 
df$C1 <- gsub("\\|", ";", df$C1)
df$C1[df$C1 == "" | is.na(df$C1)] <- "Unknown"
df$AU <- gsub("\\|", ";", df$AU)
df$DE <- gsub("\\|", ";", df$DE)
df[is.na(df)] <- "Unknown"
df[df == ""] <- "Unknown"

# 8.Export for Bibliometrix
write.csv(df, "export_scopus.csv", row.names = FALSE, fileEncoding = "UTF-8", quote = TRUE)
write.table(df, "export_scopus_clean.csv",
            sep = ",",
            row.names = FALSE,
            col.names = TRUE,
            quote = TRUE,
            fileEncoding = "UTF-8")

# 9. formatting control & test dataset
lines <- readLines("export_scopus_clean.csv", encoding = "UTF-8")
num_commas <- sapply(lines, function(x) stringr::str_count(x, ","))
which(num_commas != 12)
readLines("export_scopus_clean.csv", n = 2)
sum(is.na(df))
str(df)
write.csv(df[1:10, ], "test_scopus.csv", row.names = FALSE, fileEncoding = "UTF-8", quote = TRUE)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions