Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Code owners for repository
# Format: <file pattern> <owner>
/pipelines/ @hazelian0619
/data/processed/ @hazelian0619
/docs/ @hazelian0619
19 changes: 19 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
name: Bug report
about: Create a report to help us improve
---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1.
2.
3.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Additional context**
Add any other context about the problem here.
14 changes: 14 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
## Summary
Brief description of the changes and motivation.

## Changes
- What changed
- Why it is needed

## Validation
- Run `tools/kg_validate_table.py` for affected tables and attach validation report(s)

## Checklist
- [ ] My code follows the project's style
- [ ] Validation report attached for data changes
- [ ] CHANGELOG updated (if applicable)
42 changes: 42 additions & 0 deletions .github/workflows/data-qa.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: Data QA

on:
pull_request:
paths:
- 'data/**'
- 'pipelines/**'
- 'tools/**'
- 'docs/**'
push:
branches: [ main ]
paths:
- 'data/**'
- 'pipelines/**'
- 'tools/**'
- 'docs/**'

jobs:
validate-protein:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Run protein validation
id: validate
run: |
mkdir -p build/validate
python3 tools/kg_validate_table.py \
--contract pipelines/protein/contracts/protein_master_v6.json \
--table data/processed/protein_master_v6_clean.tsv \
--out build/validate/protein_master_v6_report.json

- name: Upload validation report
uses: actions/upload-artifact@v4
with:
name: protein_master_v6_report
path: build/validate/protein_master_v6_report.json
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Changelog

All notable changes to this project will be documented in this file.

## [Unreleased]
- Standardize documentation and add CI data validation workflow

## [v6] - 2025-10-26
- Primary protein entity table `protein_master_v6_clean.tsv` (19,135 rows × 33 cols)
- Added gene ID fields and AlphaFold v6 updates
7 changes: 7 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Code of Conduct

This project follows the Contributor Covenant v2.0. All contributors and maintainers are expected to uphold these standards.

Be respectful and collaborative. Unacceptable behavior will not be tolerated and may result in removal from project discussions or contributions.

Report conduct issues by opening an issue or contacting repository maintainers privately.
25 changes: 25 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Contributing

Thank you for contributing to Protian Entity. This document explains development, validation, and release practices used by the project.

Development workflow
- Branching: create a branch using `chore/` or `feat/` prefixes for non-breaking changes and feature work respectively (e.g., `chore/standardize-docs-ci`).
- Tests & validation: run validation before opening a PR:

```bash
python3 tools/kg_validate_table.py --contract pipelines/protein/contracts/protein_master_v6.json \
--table data/processed/protein_master_v6_clean.tsv --out build/validate/protein_master_v6_report.json
```

- Commit messages: use clear, imperative messages. Follow conventional commits if possible.

Pull requests
- Open PRs against `main`. Describe the change, test steps, and link to any data releases.
- Include validation reports for any changes to entity tables.

Releases
- Release data artifacts (large L1 tables) via GitHub Releases.
- Attach `manifest.json` with checksums, row counts, git commit SHA, and QA reports.

Contacts
- For maintenance and code ownership see `CODEOWNERS` or raise an issue.
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2026 hazelian0619

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
35 changes: 35 additions & 0 deletions README.cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# 人类蛋白质与 RNA 知识图谱(工业级数据产品)

本仓库以工业级数据产品的实践组织:可审计的 ETL 代码、数据契约(contracts)、以及 QA 报告;大型 L1 数据以 GitHub Releases 发布并附带 `manifest.json` 与校验报告。

快速索引
- 核心蛋白质表(L1):`data/processed/protein_master_v6_clean.tsv`(v6 快照)
- RNA(L1)数据以 Release 发布:`rna-l1-v1`(包含 `.tsv.gz`、`manifest.json`、QA 报告)
- 验证工具:`tools/kg_validate_table.py`

快速开始
```bash
# 验证 protein 主表并生成报告
python3 tools/kg_validate_table.py \
--contract pipelines/protein/contracts/protein_master_v6.json \
--table data/processed/protein_master_v6_clean.tsv \
--out build/validate/protein_master_v6_report.json
```

CI 与非交互运行
- CI 作业应为非交互模式(`--yes` / `--ci` 或 `CI=true`)。
- 管理员可参考 `docs/GITHUB_ACTIONS_SETTINGS.md` 配置仓库 Actions 权限(放行 fork 工作流会带来安全风险,请谨慎)。

数据发布流程
1. 使用 `pipelines/` 生成 L1 数据与 `manifest.json`(含 row counts、checksums、commit SHA、build timestamp)。
2. 运行全部验证合同并保存报告。
3. 建立 GitHub Release,上传数据包与 `manifest.json`、QA 报告。

贡献与治理
- 请参阅 `CONTRIBUTING.md` 了解分支、PR、验证与发布要求。提交修改涉及数据表或合同时请附带验证报告。

许可
- MIT — 详见 `LICENSE`

联系方式
- 仓库所有者:`@hazelian0619`,或在仓库中打开 issue / PR
150 changes: 46 additions & 104 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,114 +1,56 @@
# 人类知识图谱数据集(Protein + RNA)

这个仓库按“工业级数据产品”的方式组织:

- **代码 / 规范 / 质量报告**进入仓库(可审计、可复现)
- **体积大的数据产物**通过 **GitHub Releases** 发布(可下载、可校验、可回滚)

## 快速入口(给同事看这一段就够)

- **Protein(L1)数据集**:`data/processed/protein_master_v6_clean.tsv`(仓库内可直接下载)
- **RNA(L1, v1)数据集**:Release `rna-l1-v1`(包含 `.tsv.gz` + `manifest.json` + QA 报告)
- Release: https://github.com/hazelian0619/protian-entity/releases/tag/rna-l1-v1
- RNA 使用说明:`pipelines/rna/README.md`
- RNA 规范:`docs/rna/README.md`

---

## 🧬 Protein 实体(L1)

构建以蛋白质为中心的高质量数据集,整合 UniProt、AlphaFold、HGNC、STRING 等多源数据。

### 📊 数据概览

| 项目 | 数量/覆盖率 | 说明 |
|------|------------|------|
| **蛋白质总数** | 19,135 | 去重后的人类蛋白质 |
| **字段数** | 33 | 完整信息字段 |
| **基因ID映射** | 99.6% | NCBI + Ensembl |
| **AlphaFold结构** | 99.7% | 含质量评分 |
| **功能注释** | 86% | 含证据代码+文献 |
| **GO注释** | 82-94% | 三个维度 |
| **PDB实验结构** | 44.3% | 实验解析结构 |

**主数据文件**:`data/processed/protein_master_v6_clean.tsv` (60MB, 19,135 行 × 33 列)

---

### ✅ 核心字段(33列)

#### 基础信息
`uniprot_id` | `protein_name` | `gene_names` | `sequence` | `mass`

#### 功能注释
`function` | `subcellular_location` | `diseases` | `ptms`

#### GO注释
`go_biological_process` | `go_molecular_function` | `go_cellular_component`

#### 基因ID
`ncbi_gene_id` | `ensembl_gene_id` | `hgnc_id` | `symbol` | `gene_synonyms`

#### 结构信息
`alphafold_id` | `alphafold_mean_plddt` | `pdb_ids` | `domains`

#### 交互数据
`string_ids` | `keywords`

---

### 🚀 快速使用

```python
import pandas as pd

df = pd.read_csv('data/processed/protein_master_v6_clean.tsv', sep='\t')

tp53 = df[df['gene_names'].str.contains('TP53', na=False)]
print(tp53[['uniprot_id', 'ncbi_gene_id', 'alphafold_mean_plddt']])
# Protian Entity — Human Protein & RNA Knowledge Graph

Protian Entity is an industrial-grade data product: curated L1 entity tables (Protein, RNA) with reproducible ETL, validation contracts, and QA artifacts.

Quick summary
- Primary entity: human Protein L1 table — `data/processed/protein_master_v6_clean.tsv` (v6 snapshot)
- RNA L1 artifacts are published as release assets (`rna-l1-v1`) with `manifest.json` and QA reports
- Validation tool: `tools/kg_validate_table.py`

Badges
- CI: `.github/workflows/data-qa.yml` runs data validation on PRs and pushes validation reports as artifacts

Contents
- `data/processed/` — curated TSV L1 tables
- `pipelines/` — ETL code and contracts
- `docs/` — schema, data dictionary, quality gates, and admin guides
- `tools/` — validation and QA helpers

Quickstart (validate current protein table)
```bash
# run the table validator and write a report
python3 tools/kg_validate_table.py \
--contract pipelines/protein/contracts/protein_master_v6.json \
--table data/processed/protein_master_v6_clean.tsv \
--out build/validate/protein_master_v6_report.json

# open the JSON report
less build/validate/protein_master_v6_report.json
```

---
CI / Non-interactive usage
- CI runs must be non-interactive. Tools and scripts must support CI mode by either a `--yes` / `--ci` flag or `CI=true` environment variable.
- To run locally in non-interactive mode: `CI=true ./scripts/run_full_build.sh --yes` (if present).
- See `docs/GITHUB_ACTIONS_SETTINGS.md` for repository-level Actions configuration (Admins only).

### 📁 辅助数据
Data release process
1. Run pipelines to produce L1 artifacts and `manifest.json` (row counts, checksums, git commit, build timestamp).
2. Run all validation contracts and attach reports.
3. Create a GitHub Release and upload artifacts + `manifest.json`.

```
data/processed/
├── alphafold_quality.tsv # AlphaFold 每残基质量
├── protein_edges.tsv # STRING 交互网络(约 88 万条)
├── ptm_sites.tsv # 翻译后修饰(约 23 万条)
└── pathway_members.tsv # 通路成员(约 12 万条)
```
Governance & contribution
- See `CONTRIBUTING.md` for branching, validation, and release checklists.
- Use PR templates and attach validation report artifacts for any change that modifies tables or contracts.

---
What stays out of git
- Large raw inputs and big output artifacts should be published as release assets or stored in object storage — do NOT commit >100MB files.

### 🎯 设计原则
License
- MIT (see `LICENSE`)

- ✅ **一级信息为主**:提取原始数据,不做推断
- ✅ **以蛋白质为主体**:每个 UniProt ID 一行
- ✅ **保留完整原文**:功能描述含证据代码和 PubMed 引用
- ✅ **多源整合**:7 个主要生物数据库
Contact
- Repo owner: `@hazelian0619` — open an issue or PR for changes

---

### 📋 项目状态

**阶段**:✅ 完成 \
**版本**:v6_clean \
**更新**:2025-10-27

**数据源**:UniProt | AlphaFold | HGNC | STRING | GO | PDB \
**时效性**:截止 2025-10-26

---

## 🧬 RNA 实体(L1, v1)

RNA(miRNA + mRNA/transcript)属于 L1 实体表;输出体积较大(单文件 >100MB),所以:

- 数据产物不直接 commit 进仓库(避免触发 GitHub 单文件限制、避免仓库膨胀)
- 统一通过 **GitHub Releases** 发布,并附带可核验的 `manifest.json` 与 QA 报告

- Release: https://github.com/hazelian0619/protian-entity/releases/tag/rna-l1-v1
- 使用说明:`pipelines/rna/README.md`
- 规范:`docs/rna/README.md`
This README provides a concise starting point. For field-level schema and examples, see `docs/DATA_DICTIONARY.md` and `data/processed/README.md`.
11 changes: 11 additions & 0 deletions data/processed/README.cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# 1025 项目 - 处理后数据说明

本目录包含人类蛋白质知识图谱的核心数据集,整合了多个权威生物信息学数据库的信息。

**数据更新时间**:2025-10-26
**数据版本**:v6
**物种**:Homo sapiens (人类,Taxonomy ID: 9606)

---

(原内容已保留为中文)
13 changes: 6 additions & 7 deletions data/processed/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
# 1025 项目 - 处理后数据说明
# Processed data — Protian Entity (Human Protein Knowledge Graph)

## 概述
This folder contains the final, curated data tables used as L1 entity products. Files are UTF-8 encoded TSVs and carry provenance information (source, fetch date, source_version).

本目录包含人类蛋白质知识图谱的核心数据集,整合了多个权威生物信息学数据库的信息。

**数据更新时间**:2025-10-26
**数据版本**:v6
**物种**:Homo sapiens (人类,Taxonomy ID: 9606)
Summary
- Data snapshot date: 2025-10-26
- Version: v6
- Species: Homo sapiens (Taxonomy ID: 9606)

---

Expand Down
14 changes: 14 additions & 0 deletions docs/DATA_DICTIONARY.cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# 数据字典(Data Dictionary)

## 概述

本文档详细描述了protein_master_v6_clean.tsv主表的所有字段,包括数据类型、来源、说明和空值情况。

**主表**:protein_master_v6_clean.tsv
**行数**:19,135条
**列数**:33列
**更新日期**:2025-10-26

---

(原文已保留为中文)
Loading