Contrasting Linguistic Patterns in Human and LLM-Generated News Text

This repository contains the official implementation for the paper: Contrasting Linguistic Patterns in Human and LLM-Generated News Text Alberto Muñoz-Ortiz, Carlos Gómez-Rodríguez, and David Vilares Published in Artificial Intelligence Review, Volume 57, Article 265 (2024).

Overview

This project conducts a quantitative analysis contrasting human-written English news text with comparable large language model (LLM) output from six different LLMs covering three families and four sizes. The analysis spans several measurable linguistic dimensions, including morphological, syntactic, psychometric, and sociolinguistic aspects.

Key findings include:

Human texts exhibit more scattered sentence length distributions and vocabulary variety.
Humans show distinct use of dependency and constituent types, with shorter constituents and more optimized dependency distances.
Humans tend to exhibit stronger negative emotions compared to LLMs.
LLM outputs use more numbers, symbols, and auxiliaries than human texts.
Sexist bias prevalent in human text is also expressed by LLMs, and even magnified in most of them.

Project Structure

The repository is organized as follows:

src/: Core analysis utilities.
- utils.py: CoNLLu parsing and dependency statistics.
- constituency.py: Constituency tree analysis.
- analysis_f.py: Additional analysis functions.
scripts/: Entry-point scripts.
- download_articles.py: Download NYT articles via API.
- generate.py: Generate text using various LLMs.
- parse_classify.py: Parse and classify articles.
notebooks/: Jupyter notebooks for analysis and visualization.
- analysis.ipynb: Main analysis notebook.
data/: Generated articles from different LLMs.

Installation

Clone the repository:

git clone https://github.com/amunozo/linguistic_patterns_LLMs.git
cd linguistic_patterns_LLMs

Install dependencies:

pip install transformers stanza torch pandas tqdm requests

Setup NYT API Key (for downloading articles):
- Get your API key from NYT Developer Portal
- Copy .env.example to .env and add your key
- Important: Never commit your .env file

Usage

Downloading Articles

python scripts/download_articles.py 2023-10 2024-01 data/nyt/

Generating LLM Text

python scripts/generate.py

Parsing and Classification

python scripts/parse_classify.py

Citation

If you use this code or our findings in your research, please cite:

@article{munoz-ortiz-etal-2024-contrasting,
    title = "Contrasting Linguistic Patterns in Human and LLM-Generated News Text",
    author = "Mu{\~n}oz-Ortiz, Alberto and
      G{\\'o}mez-Rodr{\\'i}guez, Carlos and
      Vilares, David",
    journal = "Artificial Intelligence Review",
    volume = "57",
    number = "10",
    pages = "265",
    year = "2024",
    publisher = "Springer Science and Business Media LLC",
    doi = "10.1007/s10462-024-10903-2",
    url = "https://link.springer.com/article/10.1007/s10462-024-10903-2",
}

Contact

For any questions or issues, please contact the main author: Alberto Muñoz-Ortiz - alberto.munoz.ortiz@udc.es

Acknowledgments

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. We acknowledge the European Research Council (ERC), which has funded this research under the Horizon Europe research and innovation programme (SALSA, grant agreement No 101100615); SCANNER-UDC (PID2020-113230RB-C21) funded by MICIU/AEI/10.13039/501100011033; Xunta de Galicia (ED431C 2020/11); GAP (PID2022-139308OA-I00) funded by MICIU/AEI/10.13039/501100011033/ and by ERDF, EU; Grant PRE2021-097001 funded by MICIU/AEI/10.13039/501100011033 and by ESF+ (predoctoral training grant associated to project PID2020–113230RB-C21); and Centro de Investigación de Galicia "CITIC", funded by the Xunta de Galicia through the collaboration agreement between the Consellería de Cultura, Educación, Formación Profesional e Universidades and the Galician universities for the reinforcement of the research centres of the Galician University System (CIGUS). Funding for open access charge: Universidade da Coruña/CISUG.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contrasting Linguistic Patterns in Human and LLM-Generated News Text

Overview

Project Structure

Installation

Usage

Downloading Articles

Generating LLM Text

Parsing and Classification

Citation

Contact

Acknowledgments

About

Uh oh!

Releases 3

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
notebooks		notebooks
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md

amunozo/linguistic_patterns_LLMs

Folders and files

Latest commit

History

Repository files navigation

Contrasting Linguistic Patterns in Human and LLM-Generated News Text

Overview

Project Structure

Installation

Usage

Downloading Articles

Generating LLM Text

Parsing and Classification

Citation

Contact

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages