Skip to content

ruc-datalab/Unicorn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unicorn

python pytorch

This repository contains source code for the SIGMOD'2023 paper "Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration". In this paper we introduce Unicorn, a unified model for generally supporting common data matching tasks. This unified model can enable knowledge sharing by learning from multiple tasks and multiple datasets, and can also support zero-shot prediction for new tasks with zero labeled matching/non-matching pairs. Unicorn employs one generic Encoder that converts any pair of data elements (a,b) into a learned representation, and uses a Matcher, which is a binary classifier, to decide whether (a,b) is matching. Unicorn adopts a mixture-of-experts (MoE) model that enhances the learned representation into a better representation, which can further boost the performance of predictions.

Code Structure

|-- data # datasets for 20 matching tasks
|-- figs # figures
|-- main.py # pre-train Unicorn under unified prediction setting with the given 20 datasets (section 5.2 in paper)
|-- main-zero.py # pre-train Unicorn under zero-shot setting (section 5.3 in paper)
|-- main-zero-ins.py # pre-train Unicorn under zero-shot setting with instruction instruction (section 5.3 in paper)
|-- finetune.py # fine-tune Unicorn with new dataset
|-- test.py # test new dataset with the pre-trained Unicorn
|-- unicorn # code for Unicorn
    |-- dataprocess # data processing folder
        |-- dataformat.py # dataset configuration
        |-- predata.py # data processing function
    |-- model # implementation of model
        |-- encoder.py # encoder module: convert serialized (a,b) into representation
        |-- moe.py # mixture-of-experts module: convert representation into a better representation
        |-- mather.py # matcher module: convert the representation into 0 (non-matching)/1 (matching)
    |-- traner # model learning function
        |-- pretrain.py # training model function
        |-- evaluate.py # evaluation function
    |-- utils # configuration files and tools
        |-- param.py # necessary parameter
        |-- utils.py # some auxiliary functions

DataSets

We publish 20 datasets of 7 matching tasks in Unicorn. Each dataset contains train.json / valid.json / test.json. The details can be found in our paper.

  • Entity Matching
    • em-wa: Walmart-Amazon
    • em-ds: DBLP-Scholar
    • em-fz: Fodors-Zagats
    • em-ia: iTunes-Amazon
    • em-beer: Beer
  • Column Type Annotation
    • efthymiou: Efthymiou
    • t2d_col_type_anno: T2D
    • Limaye_col_type_anno: Limaye
  • Entity Linking
    • t2d: T2D
    • Limaye: Limaye
  • String Matching
    • smurf-addr: Address
    • smurf-names: Names
    • smurf-res: Researchers
    • smurf-prod: Product
    • smurf-cit: Citation
  • Schema Matching
    • fabricated_dataset: FabricatedDatasets
    • DeepMDatasets: DeepMDatasets
  • Ontology Matching
    • Illinois-onm: Cornell-Washington
  • Entity Alignment
    • dbp_yg: SRPRS: DBP-YG
    • dbp_wd: SRPRS: DBP-WD

Quick Start

Step 1: Requirements

  • Before running the code, please make sure your Python version is 3.6.5 and cuda version is 11.1. Then install necessary packages by :
  • pip install -r requirements.txt
  • pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Step 2: Run

Pre-train Unicorn with the given datasets

  • Run the script for Unicorn:
python main.py --pretrain --model deberta_base
  • Run the script for Unicorn++:
python main.py --pretrain --model deberta_base --shuffle 1 --load_balance 1 --modelname UnicornPlus
  • Run the script for Unicorn Zero-shot:
python main-zero.py --pretrain --model deberta_base
  • Run the script for Unicorn Zero-shot instruction:
python main-zero-ins.py --pretrain --model deberta_base

After the pre-training, the checkpoint folder is generated and the three modules of the model are saved: encoder.pt, moe.pt and cls.pt. If you do not want to pre-train yourself, you can download our pre-trained model directly from HuggingFace, and save them in checkpoint folder.

Finetune model with your dataset

python finetune.py --load --ckpt UnicornPlus --model deberta_base --train_dataset_path "train_file_path1.json train_file_path2.json ..." --valid_dataset_path "valid_file_path1.json valid_file_path2.json ..." --test_dataset_path "test_file_path1.json test_file_path2.json ..." --train_metrics "f1 f1 ..." --test_metrics "f1 f1 ..." --modelname UnicornPlusNew
  • This script loads the pre-trained model UnicornPlus, and uses the training data represented by --train_dataset_path to finetune UnicornPlus, then outputs new model UnicornPlusNew.
  • Note that --train_dataset_path is required, --valid_dataset_path and --test_dataset_path are optional.

Load model and direct test

python test.py --load --ckpt UnicornPlus --model deberta_base --dataset_path "test_file_path1.json test_file_path2.json ..." --test_metrics "f1 f1 ..."

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages