LegalDatasets

This repository serves as a collection of scrapers procuring and structuring various legal datasets

We want to link to already prepared legal datasets and prepare new datasets. These datasets can then be used for many downstream tasks, such as pretraining language models or judgment prediction.

Pretraining Datasets

Each of the pretraining datasets will be saved in jsonl format with the following fields:

id: unique identifier for the document (uuid5 if not present yet)
type: type of the document (e.g. legislation, caselaw, commentary)
language: language of the document
jurisdiction: jurisdiction of the document (e.g. germany)
title: title of the document
date: date of the document
url: url of the document
metadata: additional metadata of the document (as a json object)
text: the text of the document

These pretraining datasets will be used to train the language models.

Finetuning Datasets

We select a few (10 – 20) datasets to form a large-scale multi-lingual multi-jurisdictional benchmark (LEXTREME) for finetuning.

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
.dvc		.dvc
finetune		finetune
notebooks		notebooks
pretrain		pretrain
scripts		scripts
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LegalDatasets

Pretraining Datasets

Finetuning Datasets

About

Uh oh!

Releases

Packages

Languages

License

Sean-In-The-Library/LegalDatasets

Folders and files

Latest commit

History

Repository files navigation

LegalDatasets

Pretraining Datasets

Finetuning Datasets

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages