BKEE: Pioneering Event Extraction in the Vietnamese Language

Overview

BKEE is a pioneering resource designed for Event Extraction in the Vietnamese language. This repository hosts both raw and processed data ready for training, development, and testing machine learning models. The dataset significantly contributes to the field of Vietnamese Natural Language Processing by addressing the notable absence of dedicated resources for event extraction tasks.

Dataset Description

BKEE encompasses:

33+ distinct event types covering a wide range of domains
28 different event argument roles capturing various semantic roles
1,066 labeled documents with annotations for:
- Entity mentions
- Event mentions
- Event arguments

This comprehensive dataset was developed to establish strong baselines for Vietnamese event extraction tasks and facilitate future research in this domain.

Data Structure

Files and Directories

BKEE/
├── data.gz                  # Raw dataset in compressed format
└── processed/               # Processed data directory
    ├── train.json           # Training dataset
    ├── dev.json             # Development/validation dataset
    └── test.json            # Testing dataset

JSON Structure

Each processed file contains data in JSON format with the following structure:

{
    "doc_id": "test-00000",
    "sent_id": "test-00000",
    "tokens": ["Chiều", "1/12", ",", "anh", "Đặng Duy Thông", "(", "36", "tuổi", ",", "Công an viên", "xã", "Bà Điểm", ",", "huyện", "Hóc Môn", ")", "chạy", "xe máy", "trên", "đường", "Nguyễn Thị Huê", ",", "huyện", "Hóc Môn", "."],
    "sentence": "Chiều 1/12 , anh Đặng Duy Thông ( 36 tuổi , Công an viên xã Bà Điểm , huyện Hóc Môn ) chạy xe máy trên đường Nguyễn Thị Huê , huyện Hóc Môn .",
    "pieces": ["▁Chiều", "▁1", "/12", ",", "▁anh", "▁Đ", "ặng", "▁Duy", "▁Thông", "▁(", "▁36", "▁tuổi", ",", "▁Công", "▁an", "▁viên", "▁xã", "▁Bà", "▁Điểm", ",", "▁huyện", "▁H", "óc", "▁Môn", "▁)", "▁chạy", "▁xe", "▁máy", "▁trên", "▁đường", "▁Nguyễn", "▁Thị", "▁Hu", "ê", ",", "▁huyện", "▁H", "óc", "▁Môn", "."],
    "token_lens": [1, 2, 1, 1, 4, 1, 1, 1, 1, 3, 1, 2, 1, 1, 3, 1, 1, 2, 1, 1, 4, 1, 1, 3, 1],
    "entity_mentions": [],
    "event_mentions": [],
    "relation_mentions": []
}

Field Descriptions

doc_id: Unique document identifier
sent_id: Unique sentence identifier
tokens: Array of tokenized words from the sentence
sentence: Complete text of the sentence
pieces: Subword tokenization (commonly used for transformer models)
token_lens: Length of each token in subwords
entity_mentions: Annotated entities in the text
event_mentions: Annotated events in the text
relation_mentions: Annotated relations between entities

Getting Started

Clone the repository:

git clone https://github.com/yourusername/BKEE.git
cd BKEE

Extract the raw data:

gunzip data.gz

Usage

Basic Data Loading

Example code to load and explore the dataset:

import json

# Load the training data from JSON format
with open('processed/train.json', 'r', encoding='utf-8') as f:
    train_data = json.load(f)  # For standard JSON file

# Alternative: For JSONL format (JSON Lines)
def load_jsonl(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line))
    return data

# Print the first document
print(f"Document ID: {train_data[0]['doc_id']}")
print(f"Sentence: {train_data[0]['sentence']}")
print(f"Number of tokens: {len(train_data[0]['tokens'])}")

Note: The processed files may be in either standard JSON or JSONL (JSON Lines) format. The explore_bkee.py script uses the JSONL format loader.

Using the Exploration Script

The repository includes explore_bkee.py, a comprehensive utility script for analyzing the BKEE dataset:

Run the script:

python explore_bkee.py

The script provides various functionalities:
- Dataset statistics (document counts, token counts, etc.)
- Analysis of event types, entity types, and argument roles
- Search capabilities for finding specific event types or entity types
- Display of example documents with annotations

Script Features

The explore_bkee.py script offers multiple functions for dataset analysis:

# Load data from JSONL files
train_data = load_jsonl('processed/train.json')

# Get dataset statistics
stats = get_dataset_statistics(train_data)
print(f"Documents: {stats['documents']}")
print(f"Event mentions: {stats['event_mentions']}")

# Find examples with event mentions
event_examples = find_examples_with_events(train_data, limit=3)
for example in event_examples:
    display_document(example)

# Analyze event types
event_types = analyze_event_types(train_data)
for event_type, count in event_types.most_common(5):
    print(f"{event_type}: {count} instances")

# Search for specific event types
results = search_by_event_type(train_data, "Conflict:Attack")

License

This dataset is licensed under the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
NonCommercial — You may not use the material for commercial purposes.

Citation

If you use this dataset in your research, please cite our paper:

@inproceedings{nguyen-etal-2024-bkee,
    title = "{BKEE}: Pioneering Event Extraction in the {V}ietnamese Language",
    author = "Nguyen, Thi-Nhung  and
      Tran, Bang Tien  and
      Luu, Trong-Nghia  and
      Nguyen, Thien Huu  and
      Nguyen, Kiem-Hieu",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.217",
    pages = "2421--2427",
    abstract = "Event Extraction (EE) is a fundamental task in information extraction, aimed at identifying events and their associated arguments within textual data. It holds significant importance in various applications and serves as a catalyst for the development of related tasks. Despite the availability of numerous datasets and methods for event extraction in various languages, there has been a notable absence of a dedicated dataset for the Vietnamese language. To address this limitation, we propose BKEE, a novel event extraction dataset for Vietnamese. BKEE encompasses over 33 distinct event types and 28 different event argument roles, providing a labeled dataset for entity mentions, event mentions, and event arguments on 1066 documents. Additionally, we establish robust baselines for potential downstream tasks on this dataset, facilitating the analysis of challenges and future development prospects in the field of Vietnamese event extraction.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
processed		processed
LICENSE		LICENSE
README.md		README.md
data.gz		data.gz
explore_bkee.py		explore_bkee.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BKEE: Pioneering Event Extraction in the Vietnamese Language

Table of Contents

Overview

Dataset Description

Data Structure

Files and Directories

JSON Structure

Field Descriptions

Getting Started

Usage

Basic Data Loading

Using the Exploration Script

Script Features

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

nhungnt7/BKEE

Folders and files

Latest commit

History

Repository files navigation

BKEE: Pioneering Event Extraction in the Vietnamese Language

Table of Contents

Overview

Dataset Description

Data Structure

Files and Directories

JSON Structure

Field Descriptions

Getting Started

Usage

Basic Data Loading

Using the Exploration Script

Script Features

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages