Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Code owners for repository
# Format: <file pattern> <owner>
/pipelines/ @hazelian0619
/data/processed/ @hazelian0619
/docs/ @hazelian0619
19 changes: 19 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
name: Bug report
about: Create a report to help us improve
---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1.
2.
3.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Additional context**
Add any other context about the problem here.
14 changes: 14 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
## Summary
Brief description of the changes and motivation.

## Changes
- What changed
- Why it is needed

## Validation
- Run `tools/kg_validate_table.py` for affected tables and attach validation report(s)

## Checklist
- [ ] My code follows the project's style
- [ ] Validation report attached for data changes
- [ ] CHANGELOG updated (if applicable)
42 changes: 42 additions & 0 deletions .github/workflows/data-qa.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: Data QA

on:
pull_request:
paths:
- 'data/**'
- 'pipelines/**'
- 'tools/**'
- 'docs/**'
push:
branches: [ main ]
paths:
- 'data/**'
- 'pipelines/**'
- 'tools/**'
- 'docs/**'

jobs:
validate-protein:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Run protein validation
id: validate
run: |
mkdir -p build/validate
python3 tools/kg_validate_table.py \
--contract pipelines/protein/contracts/protein_master_v6.json \
--table data/processed/protein_master_v6_clean.tsv \
--out build/validate/protein_master_v6_report.json

- name: Upload validation report
uses: actions/upload-artifact@v4
with:
name: protein_master_v6_report
path: build/validate/protein_master_v6_report.json
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Changelog

All notable changes to this project will be documented in this file.

## [Unreleased]
- Standardize documentation and add CI data validation workflow

## [v6] - 2025-10-26
- Primary protein entity table `protein_master_v6_clean.tsv` (19,135 rows × 33 cols)
- Added gene ID fields and AlphaFold v6 updates
7 changes: 7 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Code of Conduct

This project follows the Contributor Covenant v2.0. All contributors and maintainers are expected to uphold these standards.

Be respectful and collaborative. Unacceptable behavior will not be tolerated and may result in removal from project discussions or contributions.

Report conduct issues by opening an issue or contacting repository maintainers privately.
25 changes: 25 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Contributing

Thank you for contributing to Protian Entity. This document explains development, validation, and release practices used by the project.

Development workflow
- Branching: create a branch using `chore/` or `feat/` prefixes for non-breaking changes and feature work respectively (e.g., `chore/standardize-docs-ci`).
- Tests & validation: run validation before opening a PR:

```bash
python3 tools/kg_validate_table.py --contract pipelines/protein/contracts/protein_master_v6.json \
--table data/processed/protein_master_v6_clean.tsv --out build/validate/protein_master_v6_report.json
```

- Commit messages: use clear, imperative messages. Follow conventional commits if possible.

Pull requests
- Open PRs against `main`. Describe the change, test steps, and link to any data releases.
- Include validation reports for any changes to entity tables.

Releases
- Release data artifacts (large L1 tables) via GitHub Releases.
- Attach `manifest.json` with checksums, row counts, git commit SHA, and QA reports.

Contacts
- For maintenance and code ownership see `CODEOWNERS` or raise an issue.
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2026 hazelian0619

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
18 changes: 18 additions & 0 deletions README.cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# 人类知识图谱数据集(Protein + RNA)

这个仓库按“工业级数据产品”的方式组织:

- **代码 / 规范 / 质量报告**进入仓库(可审计、可复现)
- **体积大的数据产物**通过 **GitHub Releases** 发布(可下载、可校验、可回滚)

## 快速入口(给同事看这一段就够)

- **Protein(L1)数据集**:`data/processed/protein_master_v6_clean.tsv`(仓库内可直接下载)
- **RNA(L1, v1)数据集**:Release `rna-l1-v1`(包含 `.tsv.gz` + `manifest.json` + QA 报告)
- Release: https://github.com/hazelian0619/protian-entity/releases/tag/rna-l1-v1
- RNA 使用说明:`pipelines/rna/README.md`
- RNA 规范:`docs/rna/README.md`

---

(原 README 内容已保留为中文)
117 changes: 40 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,84 +1,47 @@
# 人类知识图谱数据集(Protein + RNA)

这个仓库按“工业级数据产品”的方式组织:

- **代码 / 规范 / 质量报告**进入仓库(可审计、可复现)
- **体积大的数据产物**通过 **GitHub Releases** 发布(可下载、可校验、可回滚)

## 快速入口(给同事看这一段就够)

- **Protein(L1)数据集**:`data/processed/protein_master_v6_clean.tsv`(仓库内可直接下载)
- **RNA(L1, v1)数据集**:Release `rna-l1-v1`(包含 `.tsv.gz` + `manifest.json` + QA 报告)
- Release: https://github.com/hazelian0619/protian-entity/releases/tag/rna-l1-v1
- RNA 使用说明:`pipelines/rna/README.md`
- RNA 规范:`docs/rna/README.md`

---

## 🧬 Protein 实体(L1)

构建以蛋白质为中心的高质量数据集,整合 UniProt、AlphaFold、HGNC、STRING 等多源数据。

### 📊 数据概览

| 项目 | 数量/覆盖率 | 说明 |
|------|------------|------|
| **蛋白质总数** | 19,135 | 去重后的人类蛋白质 |
| **字段数** | 33 | 完整信息字段 |
| **基因ID映射** | 99.6% | NCBI + Ensembl |
| **AlphaFold结构** | 99.7% | 含质量评分 |
| **功能注释** | 86% | 含证据代码+文献 |
| **GO注释** | 82-94% | 三个维度 |
| **PDB实验结构** | 44.3% | 实验解析结构 |

**主数据文件**:`data/processed/protein_master_v6_clean.tsv` (60MB, 19,135 行 × 33 列)

---

### ✅ 核心字段(33列)

#### 基础信息
`uniprot_id` | `protein_name` | `gene_names` | `sequence` | `mass`

#### 功能注释
`function` | `subcellular_location` | `diseases` | `ptms`

#### GO注释
`go_biological_process` | `go_molecular_function` | `go_cellular_component`

#### 基因ID
`ncbi_gene_id` | `ensembl_gene_id` | `hgnc_id` | `symbol` | `gene_synonyms`

#### 结构信息
`alphafold_id` | `alphafold_mean_plddt` | `pdb_ids` | `domains`

#### 交互数据
`string_ids` | `keywords`

---

### 🚀 快速使用

```python
import pandas as pd

df = pd.read_csv('data/processed/protein_master_v6_clean.tsv', sep='\t')

tp53 = df[df['gene_names'].str.contains('TP53', na=False)]
print(tp53[['uniprot_id', 'ncbi_gene_id', 'alphafold_mean_plddt']])
# Protian Entity — Human Protein & RNA Knowledge Graph (Industrial Data Product)

This repository contains curated human Protein and RNA entity datasets and the code, contracts, and QA artifacts required to build, validate, and release them as industrial-grade data products.

Key principles
- Code, contracts, and QA reports are tracked in Git for auditability and reproducibility.
- Large data artifacts (L1 tables) are published via GitHub Releases with a manifest and checksums.

Quick links
- Protein (L1) dataset: `data/processed/protein_master_v6_clean.tsv`
- RNA (L1) dataset: release `rna-l1-v1` (see `pipelines/rna/README.md`)
- Validation tool: `tools/kg_validate_table.py`

Repository layout
- `data/processed/` — final, curated TSV tables (small-to-medium L1 tables are stored here when size permits)
- `pipelines/` — extraction and ETL pipelines (e.g., `pipelines/rna/`)
- `docs/` — design documents, data dictionary, quality gate definitions
- `scripts/`, `tools/` — helper and validation scripts

Data release model
1. Build entity tables using `pipelines/` scripts in a reproducible environment.
2. Produce `manifest.json` (checksums, row counts, git commit, build timestamp).
3. Publish artifacts as a GitHub Release and attach QA reports.

Getting started
1. Clone repository
2. Review `docs/DATA_DICTIONARY.md` and `docs/QUALITY_GATES.md` for schema and validation rules
3. Run validation example:

```bash
python3 tools/kg_validate_table.py --contract pipelines/protein/contracts/protein_master_v6.json \
--table data/processed/protein_master_v6_clean.tsv --out build/validate/protein_master_v6_report.json
```

---
Contributing
- See `CONTRIBUTING.md` for development, testing, and release workflow.

### 📁 辅助数据
License
- MIT License — see `LICENSE` for details.

Contact
- Repository owner: hazelian0619
- Project maintenance: see `CODEOWNERS` or `CONTRIBUTING.md` for maintainers and contact instructions

```
data/processed/
├── alphafold_quality.tsv # AlphaFold 每残基质量
├── protein_edges.tsv # STRING 交互网络(约 88 万条)
├── ptm_sites.tsv # 翻译后修饰(约 23 万条)
└── pathway_members.tsv # 通路成员(约 12 万条)
```

---

Expand Down
11 changes: 11 additions & 0 deletions data/processed/README.cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# 1025 项目 - 处理后数据说明

本目录包含人类蛋白质知识图谱的核心数据集,整合了多个权威生物信息学数据库的信息。

**数据更新时间**:2025-10-26
**数据版本**:v6
**物种**:Homo sapiens (人类,Taxonomy ID: 9606)

---

(原内容已保留为中文)
13 changes: 6 additions & 7 deletions data/processed/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
# 1025 项目 - 处理后数据说明
# Processed data — Protian Entity (Human Protein Knowledge Graph)

## 概述
This folder contains the final, curated data tables used as L1 entity products. Files are UTF-8 encoded TSVs and carry provenance information (source, fetch date, source_version).

本目录包含人类蛋白质知识图谱的核心数据集,整合了多个权威生物信息学数据库的信息。

**数据更新时间**:2025-10-26
**数据版本**:v6
**物种**:Homo sapiens (人类,Taxonomy ID: 9606)
Summary
- Data snapshot date: 2025-10-26
- Version: v6
- Species: Homo sapiens (Taxonomy ID: 9606)

---

Expand Down
14 changes: 14 additions & 0 deletions docs/DATA_DICTIONARY.cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# 数据字典(Data Dictionary)

## 概述

本文档详细描述了protein_master_v6_clean.tsv主表的所有字段,包括数据类型、来源、说明和空值情况。

**主表**:protein_master_v6_clean.tsv
**行数**:19,135条
**列数**:33列
**更新日期**:2025-10-26

---

(原文已保留为中文)
24 changes: 14 additions & 10 deletions docs/DATA_DICTIONARY.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,21 @@
# 数据字典(Data Dictionary
# Data Dictionary — `protein_master_v6_clean.tsv`

## 概述
This document describes the schema and field-level details for the primary protein entity table `protein_master_v6_clean.tsv` (v6 snapshot).

本文档详细描述了protein_master_v6_clean.tsv主表的所有字段,包括数据类型、来源、说明和空值情况。
Summary
- Rows: 19,135
- Columns: 33
- Snapshot date: 2025-10-26

**主表**:protein_master_v6_clean.tsv
**行数**:19,135条
**列数**:33列
**更新日期**:2025-10-26
Field categories
- Core identifiers: `uniprot_id`, `entry_name`, `protein_name`, `symbol`, `hgnc_id`
- Sequence: `sequence`, `sequence_len`, `mass`
- Cross references and gene IDs: `ncbi_gene_id`, `ensembl_gene_id`, `ensembl_transcript_id`, `gene_synonyms`
- Functional annotations: `function`, `go_biological_process`, `go_molecular_function`, `go_cellular_component`
- Structural information: `pdb_ids`, `alphafold_pdb_url`, `alphafold_mean_plddt`
- Localization and PTMs: `subcellular_location`, `ptms`, `diseases`, `domains`, `isoforms`

---

## 字段详细说明
For full, field-level descriptions and examples see the original Chinese doc preserved in `docs/DATA_DICTIONARY.cn.md`.

### 一、基础标识字段

Expand Down
Loading