hazelian0619 · hazelian0619 · Feb 5, 2026 · Feb 5, 2026 · Feb 5, 2026
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -0,0 +1,5 @@
+# Code owners for repository
+# Format: <file pattern> <owner>
+/pipelines/ @hazelian0619
+/data/processed/ @hazelian0619
+/docs/ @hazelian0619
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,19 @@
+---
+name: Bug report
+about: Create a report to help us improve
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**To Reproduce**
+Steps to reproduce the behavior:
+1. 
+2. 
+3. 
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Additional context**
+Add any other context about the problem here.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -0,0 +1,14 @@
+## Summary
+Brief description of the changes and motivation.
+
+## Changes
+- What changed
+- Why it is needed
+
+## Validation
+- Run `tools/kg_validate_table.py` for affected tables and attach validation report(s)
+
+## Checklist
+- [ ] My code follows the project's style
+- [ ] Validation report attached for data changes
+- [ ] CHANGELOG updated (if applicable)
diff --git a/.github/workflows/data-qa.yml b/.github/workflows/data-qa.yml
@@ -0,0 +1,42 @@
+name: Data QA
+
+on:
+  pull_request:
+    paths:
+      - 'data/**'
+      - 'pipelines/**'
+      - 'tools/**'
+      - 'docs/**'
+  push:
+    branches: [ main ]
+    paths:
+      - 'data/**'
+      - 'pipelines/**'
+      - 'tools/**'
+      - 'docs/**'
+
+jobs:
+  validate-protein:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.11'
+
+      - name: Run protein validation
+        id: validate
+        run: |
+          mkdir -p build/validate
+          python3 tools/kg_validate_table.py \
+            --contract pipelines/protein/contracts/protein_master_v6.json \
+            --table data/processed/protein_master_v6_clean.tsv \
+            --out build/validate/protein_master_v6_report.json
+
+      - name: Upload validation report
+        uses: actions/upload-artifact@v4
+        with:
+          name: protein_master_v6_report
+          path: build/validate/protein_master_v6_report.json
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,10 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+## [Unreleased]
+- Standardize documentation and add CI data validation workflow
+
+## [v6] - 2025-10-26
+- Primary protein entity table `protein_master_v6_clean.tsv` (19,135 rows × 33 cols)
+- Added gene ID fields and AlphaFold v6 updates
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,7 @@
+# Code of Conduct
+
+This project follows the Contributor Covenant v2.0. All contributors and maintainers are expected to uphold these standards.
+
+Be respectful and collaborative. Unacceptable behavior will not be tolerated and may result in removal from project discussions or contributions.
+
+Report conduct issues by opening an issue or contacting repository maintainers privately.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,25 @@
+# Contributing
+
+Thank you for contributing to Protian Entity. This document explains development, validation, and release practices used by the project.
+
+Development workflow
+- Branching: create a branch using `chore/` or `feat/` prefixes for non-breaking changes and feature work respectively (e.g., `chore/standardize-docs-ci`).
+- Tests & validation: run validation before opening a PR:
+
+```bash
+python3 tools/kg_validate_table.py --contract pipelines/protein/contracts/protein_master_v6.json \
+  --table data/processed/protein_master_v6_clean.tsv --out build/validate/protein_master_v6_report.json
+```
+
+- Commit messages: use clear, imperative messages. Follow conventional commits if possible.
+
+Pull requests
+- Open PRs against `main`. Describe the change, test steps, and link to any data releases.
+- Include validation reports for any changes to entity tables.
+
+Releases
+- Release data artifacts (large L1 tables) via GitHub Releases.
+- Attach `manifest.json` with checksums, row counts, git commit SHA, and QA reports.
+
+Contacts
+- For maintenance and code ownership see `CODEOWNERS` or raise an issue.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2026 hazelian0619
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff --git a/README.cn.md b/README.cn.md
@@ -0,0 +1,35 @@
+# 人类蛋白质与 RNA 知识图谱（工业级数据产品）
+
+本仓库以工业级数据产品的实践组织：可审计的 ETL 代码、数据契约（contracts）、以及 QA 报告；大型 L1 数据以 GitHub Releases 发布并附带 `manifest.json` 与校验报告。
+
+快速索引
+- 核心蛋白质表（L1）：`data/processed/protein_master_v6_clean.tsv`（v6 快照）
+- RNA（L1）数据以 Release 发布：`rna-l1-v1`（包含 `.tsv.gz`、`manifest.json`、QA 报告）
+- 验证工具：`tools/kg_validate_table.py`
+
+快速开始
+```bash
+# 验证 protein 主表并生成报告
+python3 tools/kg_validate_table.py \
+  --contract pipelines/protein/contracts/protein_master_v6.json \
+  --table data/processed/protein_master_v6_clean.tsv \
+  --out build/validate/protein_master_v6_report.json
+```
+
+CI 与非交互运行
+- CI 作业应为非交互模式（`--yes` / `--ci` 或 `CI=true`）。
+- 管理员可参考 `docs/GITHUB_ACTIONS_SETTINGS.md` 配置仓库 Actions 权限（放行 fork 工作流会带来安全风险，请谨慎）。
+
+数据发布流程
+1. 使用 `pipelines/` 生成 L1 数据与 `manifest.json`（含 row counts、checksums、commit SHA、build timestamp）。
+2. 运行全部验证合同并保存报告。
+3. 建立 GitHub Release，上传数据包与 `manifest.json`、QA 报告。
+
+贡献与治理
+- 请参阅 `CONTRIBUTING.md` 了解分支、PR、验证与发布要求。提交修改涉及数据表或合同时请附带验证报告。
+
+许可
+- MIT — 详见 `LICENSE`
+
+联系方式
+- 仓库所有者：`@hazelian0619`，或在仓库中打开 issue / PR
diff --git a/README.md b/README.md
@@ -1,114 +1,56 @@
-# 人类知识图谱数据集（Protein + RNA）
-
-这个仓库按“工业级数据产品”的方式组织：
-
-- **代码 / 规范 / 质量报告**进入仓库（可审计、可复现）
-- **体积大的数据产物**通过 **GitHub Releases** 发布（可下载、可校验、可回滚）
-
-## 快速入口（给同事看这一段就够）
-
-- **Protein（L1）数据集**：`data/processed/protein_master_v6_clean.tsv`（仓库内可直接下载）
-- **RNA（L1, v1）数据集**：Release `rna-l1-v1`（包含 `.tsv.gz` + `manifest.json` + QA 报告）
-  - Release: https://github.com/hazelian0619/protian-entity/releases/tag/rna-l1-v1
-  - RNA 使用说明：`pipelines/rna/README.md`
-  - RNA 规范：`docs/rna/README.md`
-
----
-
-## 🧬 Protein 实体（L1）
-
-构建以蛋白质为中心的高质量数据集，整合 UniProt、AlphaFold、HGNC、STRING 等多源数据。
-
-### 📊 数据概览
-
-| 项目 | 数量/覆盖率 | 说明 |
-|------|------------|------|
-| **蛋白质总数** | 19,135 | 去重后的人类蛋白质 |
-| **字段数** | 33 | 完整信息字段 |
-| **基因ID映射** | 99.6% | NCBI + Ensembl |
-| **AlphaFold结构** | 99.7% | 含质量评分 |
-| **功能注释** | 86% | 含证据代码+文献 |
-| **GO注释** | 82-94% | 三个维度 |
-| **PDB实验结构** | 44.3% | 实验解析结构 |
-
-**主数据文件**：`data/processed/protein_master_v6_clean.tsv` (60MB, 19,135 行 × 33 列)
-
----
-
-### ✅ 核心字段（33列）
-
-#### 基础信息
-`uniprot_id` | `protein_name` | `gene_names` | `sequence` | `mass`
-
-#### 功能注释
-`function` | `subcellular_location` | `diseases` | `ptms`
-
-#### GO注释
-`go_biological_process` | `go_molecular_function` | `go_cellular_component`
-
-#### 基因ID
-`ncbi_gene_id` | `ensembl_gene_id` | `hgnc_id` | `symbol` | `gene_synonyms`
-
-#### 结构信息
-`alphafold_id` | `alphafold_mean_plddt` | `pdb_ids` | `domains`
-
-#### 交互数据
-`string_ids` | `keywords`
-
----
-
-### 🚀 快速使用
-
-```python
-import pandas as pd
-
-df = pd.read_csv('data/processed/protein_master_v6_clean.tsv', sep='\t')
-
-tp53 = df[df['gene_names'].str.contains('TP53', na=False)]
-print(tp53[['uniprot_id', 'ncbi_gene_id', 'alphafold_mean_plddt']])
+# Protian Entity — Human Protein & RNA Knowledge Graph
+
+Protian Entity is an industrial-grade data product: curated L1 entity tables (Protein, RNA) with reproducible ETL, validation contracts, and QA artifacts.
+
+Quick summary
+- Primary entity: human Protein L1 table — `data/processed/protein_master_v6_clean.tsv` (v6 snapshot)
+- RNA L1 artifacts are published as release assets (`rna-l1-v1`) with `manifest.json` and QA reports
+- Validation tool: `tools/kg_validate_table.py`
+
+Badges
+- CI: `.github/workflows/data-qa.yml` runs data validation on PRs and pushes validation reports as artifacts
+
+Contents
+- `data/processed/` — curated TSV L1 tables
+- `pipelines/` — ETL code and contracts
+- `docs/` — schema, data dictionary, quality gates, and admin guides
+- `tools/` — validation and QA helpers
+
+Quickstart (validate current protein table)
+```bash
+# run the table validator and write a report
+python3 tools/kg_validate_table.py \
+  --contract pipelines/protein/contracts/protein_master_v6.json \
+  --table data/processed/protein_master_v6_clean.tsv \
+  --out build/validate/protein_master_v6_report.json
+
+# open the JSON report
+less build/validate/protein_master_v6_report.json
 ```
 
----
+CI / Non-interactive usage
+- CI runs must be non-interactive. Tools and scripts must support CI mode by either a `--yes` / `--ci` flag or `CI=true` environment variable.
+- To run locally in non-interactive mode: `CI=true ./scripts/run_full_build.sh --yes` (if present).
+- See `docs/GITHUB_ACTIONS_SETTINGS.md` for repository-level Actions configuration (Admins only).
 
-### 📁 辅助数据
+Data release process
+1. Run pipelines to produce L1 artifacts and `manifest.json` (row counts, checksums, git commit, build timestamp).
+2. Run all validation contracts and attach reports.
+3. Create a GitHub Release and upload artifacts + `manifest.json`.
 
-```
-data/processed/
-├── alphafold_quality.tsv       # AlphaFold 每残基质量
-├── protein_edges.tsv           # STRING 交互网络（约 88 万条）
-├── ptm_sites.tsv               # 翻译后修饰（约 23 万条）
-└── pathway_members.tsv         # 通路成员（约 12 万条）
-```
+Governance & contribution
+- See `CONTRIBUTING.md` for branching, validation, and release checklists.
+- Use PR templates and attach validation report artifacts for any change that modifies tables or contracts.
 
----
+What stays out of git
+- Large raw inputs and big output artifacts should be published as release assets or stored in object storage — do NOT commit >100MB files.
 
-### 🎯 设计原则
+License
+- MIT (see `LICENSE`)
 
-- ✅ **一级信息为主**：提取原始数据，不做推断
-- ✅ **以蛋白质为主体**：每个 UniProt ID 一行
-- ✅ **保留完整原文**：功能描述含证据代码和 PubMed 引用
-- ✅ **多源整合**：7 个主要生物数据库
+Contact
+- Repo owner: `@hazelian0619` — open an issue or PR for changes
 
 ---
 
-### 📋 项目状态
-
-**阶段**：✅ 完成  \
-**版本**：v6_clean  \
-**更新**：2025-10-27
-
-**数据源**：UniProt | AlphaFold | HGNC | STRING | GO | PDB  \
-**时效性**：截止 2025-10-26
-
----
-
-## 🧬 RNA 实体（L1, v1）
-
-RNA（miRNA + mRNA/transcript）属于 L1 实体表；输出体积较大（单文件 >100MB），所以：
-
-- 数据产物不直接 commit 进仓库（避免触发 GitHub 单文件限制、避免仓库膨胀）
-- 统一通过 **GitHub Releases** 发布，并附带可核验的 `manifest.json` 与 QA 报告
-
-- Release: https://github.com/hazelian0619/protian-entity/releases/tag/rna-l1-v1
-- 使用说明：`pipelines/rna/README.md`
-- 规范：`docs/rna/README.md`
+This README provides a concise starting point. For field-level schema and examples, see `docs/DATA_DICTIONARY.md` and `data/processed/README.md`.
@@ -0,0 +1,11 @@
+# 1025 项目 - 处理后数据说明
+
+本目录包含人类蛋白质知识图谱的核心数据集，整合了多个权威生物信息学数据库的信息。
+
+**数据更新时间**：2025-10-26
+**数据版本**：v6
+**物种**：Homo sapiens (人类，Taxonomy ID: 9606)
+
+---
+
+（原内容已保留为中文）
@@ -1,12 +1,11 @@
-# 1025 项目 - 处理后数据说明
+# Processed data — Protian Entity (Human Protein Knowledge Graph)
 
-## 概述
+This folder contains the final, curated data tables used as L1 entity products. Files are UTF-8 encoded TSVs and carry provenance information (source, fetch date, source_version).
 
-本目录包含人类蛋白质知识图谱的核心数据集，整合了多个权威生物信息学数据库的信息。
-
-**数据更新时间**：2025-10-26
-**数据版本**：v6
-**物种**：Homo sapiens (人类，Taxonomy ID: 9606)
+Summary
+- Data snapshot date: 2025-10-26
+- Version: v6
+- Species: Homo sapiens (Taxonomy ID: 9606)
 
 ---
 

@@ -0,0 +1,14 @@
+# 数据字典（Data Dictionary）
+
+## 概述
+
+本文档详细描述了protein_master_v6_clean.tsv主表的所有字段，包括数据类型、来源、说明和空值情况。
+
+**主表**：protein_master_v6_clean.tsv
+**行数**：19,135条
+**列数**：33列
+**更新日期**：2025-10-26
+
+---
+
+（原文已保留为中文）