newEuropeReviewR

This R data package contains a text corpus of articles from the complete run of the foreign policy weekly The New Europe that were published between 1916 and 1920. It also contains an additional file with frequencies of terms that are associated with countries mentioned in the pages of The New Europe weekly. Although the texts deal mainly with analysis of the wartime developments, many contributions contain early concepts of political changes in Europe that envisioned political borders more closely aligned along the ethnic and linguistic lines. The weekly offered a radical and unconventional view of the postwar Europe. The corpus reveals how the views of The New Europe's authors were anchored in the prewar political theory and how this thinking evolved during the First World War. Crucially, the texts reveal how this thinking impacted the shaping of the new postwar world order at the Paris Peace Conference in 1919.

The New Europe corpus contains 2128 texts: 1442 articles, 411 notes typically unsigned and contributed by the editors, 266 reviews, and 9 letters to the editors. 210 numbered issues were bound in seventeen volumes. On average, there were ten texts in each issue. In the dataset, each text is identified by a sequential identifier, but the bibliographic metadata for each text are also part of the dataset.

The newEuropeReviewR data package contains two datasets, tne_articles and tne_countries.

Data

The tne_articles dataset contains the full-text of published articles together with their metadata (author, the normalised name of the author, title, date of publication, type of text, and volume and issue numbers) per line.

In order to track the geographic distribution of coverage of The New Europe, the tne_countries dataset was created. This dataset contains five columns: the document identifier (which links the two datasets), a two-letter country code, the number of times a term related to the identifier was mentioned, the date when the article was published, and the type of text in The New Europe weekly. The tne_countries dataset was created with a yaml dictionary run against the whole The New Europe corpus using the quanteda package. The dictionary is uploaded to this repository in the directory dictionary.

Installation

remotes::install_github("standap/newEuropeReviewR")

Run the data

After installing newEuropeReviewR, simply write the name of the dataset. The data frame is ready to be used as is. The data are already in the tidy format.

No additional packages are required to use this data package. The quanteda code below is provided for documentation and transparency, showing how the tne_countries dataset was created.

tne_articles

Usage examples

# Five top authors. Most contributions are not signed, often written by Seton-Watson.

tne_articles |> 
  count(author_norm, sort = TRUE) |> 
  head(5)

# Most frequently mentioned countries and the related terms

tne_countries |> 
  count(code, wt = count, sort = TRUE) |> 
  head(5)

Dictionary processing

Example of the hierarchical structure of the yaml dictionary:

EUROPE:
  EAST:
     AH': [Austria-Hungar*, Austro-Hungar*, Ballhausplatz, Habsburg]  
    'BG': [Bulgaria*, Bulgarian*, Bulgars, Sofia] 
    # A historically more relevant entry for Czechoslovakia
    'CS': [Czechoslovak*, Czecho-Slovakia, Czech*, Slovak*, Slovakia, Prague, Bratislava, Ruthenian*, Pressburg, Bohemia, Moravia]  
    #'CZ': [Czech Republic, Czech*, Prague]  
    'HU': [Hungary, Hungarian*, Budapest, Magyar*]  
    'PL': [Poland, Polish, Pole*, Warsaw, Cracow, Danzig]  
    'RO': [Ro*mania, Ro*manian*, Bucharest, Transylvania*, Bessarabia*, Bukovina, Wallachia*, Dobrudja]  
    'JW': [Jews, Jewish, Hebrew, Zionis*, Jerusalem]

# How was the dictionary run
library(quanteda)
articles_corp <- corpus(tne_articles)
articles_tokens <- tokens(articles_corp, 
                        remove_punct = TRUE, 
                        remove_numbers = TRUE)
articles_cln <- tokens_remove(articles_tokens, pattern = stopwords("en"))

dict_newsmap <- dictionary(file = "dictionary/newEurope1918-1920.yml")

countries <- tokens_lookup(articles_cln, dictionary = dict_newsmap, levels = 3)
regions_dfm <- dfm(countries)

Processing of the corpus

The texts downloaded from the HathiTrust Digital Library were unstructured and full of OCR errors. The raw text files, therefore, underwent several processing steps to produce a clean corpus ready for analysis.

The HATHI/Google introductory sections in each volume were removed from the raw downloaded text files.
Peritext sections were annotated, first with a script, and then manually.
The identified peritext segments such as covers, table of contents, running titles, advertisements, indices, colophon, printer statements, and page numbers were removed.
Important metadata, such as titles of the articles, names of authors, and footnotes were also annotated. After the metadata were transferred into the data table, lines with titles, names of authors, and footnotes were removed.
Individual texts were classified into broad categories, e.g., articles, letters to the editor, notes, and reviews.
Introductory paragraphs by the editors for several articles or letters to the editor were removed from the corpus.
The content of the literary and art sections that were introduced toward the end of the review's existence was removed from the corpus.
All texts were manually inspected for typos and OCR errors, which were removed.

Provenance

The corpus was created from seventeen bound volumes in the Hathi Trust Digital Library.

Table: List of bound volumes with number of issues that make up the newEuropeReview corpus.

htid	volume	issues
njp.32101059985000	I	13
umn.31951000751810	II	13
njp.32101059985026	III	13
njp.32101059985034	IV	13
njp.32101059985042	V	13
njp.32101059985059	VI	12
njp.32101059985067	VII	13
njp.32101059985075	VIII	13
njp.32101059985083	IX	13
njp.32101059985091	X	13
njp.32101059985109	XI	13
njp.32101059985117	XII	13
njp.32101059985125	XIII	13
njp.32101059985133	XIV	13
njp.32101059985141	XV	13
njp.32101059985158	XVI	13
njp.32101059985166	XVII	3

Updates

The package was updated on 2026-03-24 when publication dates in tne_articles for three issues, due to OCR errors, were corrected. This is version 1.0.0.

How to cite this work

To cite this package, use the citation information provided in CITATION.cff or click "Cite this repository" in the sidebar.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
R		R
data-raw		data-raw
data		data
dictionary		dictionary
inst		inst
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CITATION.cff		CITATION.cff
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
newEuropeReviewR.Rproj		newEuropeReviewR.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

newEuropeReviewR

Data

Installation

Run the data

Usage examples

Dictionary processing

Processing of the corpus

Provenance

Updates

How to cite this work

About

Licenses found

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

newEuropeReviewR

Data

Installation

Run the data

Usage examples

Dictionary processing

Processing of the corpus

Provenance

Updates

How to cite this work

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages