Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,9 +214,9 @@ text = get_plain_text_fast(html_source)

## Pipeline

1. [HTML pre-dedup](jupyter/html-pre-dedup/main.ipynb)
1. [HTML pre-dedup](jupyter/html-pre-dedup/README.md)
2. [domain clustering](jupyter/domain_clustering/README.md)
3. [layout clustering](jupyter/layout-clustering/main.ipynb)
3. [layout clustering](jupyter/layout-clustering/README.md)
4. [typical layout node selection](jupyter/typical-html-select/main.ipynb)
5. [HTML node select by LLM](jupyter/html-node-select-llm/main.ipynb)
6. [html parse layout by layout](jupyter/html-parse-by-layout/main.ipynb)
Expand Down
66 changes: 63 additions & 3 deletions jupyter/html-pre-dedup/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,85 @@
# 预去重按照如下顺序执行

## cc_dedup_fir.ipynb
## 流程方案

![img.png](assets/img.png)

## 执行步骤

### cc_dedup_fir.ipynb

输入参数:

```
DUMPS: cc warc 文件对应的dump
CC_WARC: cc warc 文件对应的 s3 路径,不包含 dump, 程序依据不同dump分批执行
output_path: s3 输出路径
```

输出数据结构:

```json
{
"track_id": "a6dcf951-42de-4ec6-a0d5-9af1ee73ab03",
"sub_path": "dump",
"hash_html": "0005dcd7fdce5a28efb9eec848b2caf95da971e1d58598e3ff52c3c4eb66882d"
}
```

## cc_dedup_sec.ipynb
### cc_dedup_sec.ipynb

输入参数:

```
DUMPS: cc warc 文件对应的dump
base_input_path:第一步 cc_dedup_fir.ipynb 执行产生的 s3 路径,不包含dump,程序依据不同dump分批执行
这里输入路径需要以 s3a 开头,否则程序会报错
already_exist_id_path:存放已经去重的id path
output_path: s3 输出路径
```

输出数据结构:

```json
{
"track_id": "a6dcf951-42de-4ec6-a0d5-9af1ee73ab03",
"sub_path": "dump",
"hash_html": "0005dcd7fdce5a28efb9eec848b2caf95da971e1d58598e3ff52c3c4eb66882d"
}
```

## cc_dedup_thi.ipynb
### cc_dedup_thi.ipynb

输入参数:

```
DUMPS: cc warc 文件对应的dump
CC_WARC: cc warc 文件对应的 s3 路径,不包含 dump, 程序依据不同dump分批执行
base_unique_path: 第二步 cc_dedup_sec.ipynb 执行产生的 s3 路径
output_path: s3 输出路径
```

输出数据结构:

```json
{
"date": 1369152963,
"response_header": {
"Connection": "close",
"Content-Type": "text/html"
},
"track_id": "0008dd8c-4f4b-424a-af7b-ebd73fabc49a",
"remark": {
"warc_headers": {
"WARC-IP-Address": "186.192.82.88"
}
},
"html": "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" </html>",
"content_charset": "utf-8",
"content_length": 97605,
"url": "http://www.test.com/",
"status": 200,
"sub_path": "dump",
"raw_warc_path": "s3://xxx" // cc源数据path
}
```
Binary file added jupyter/html-pre-dedup/assets/img.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions jupyter/html-pre-dedup/cc_dedup_thr.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@
"source": [
"def parse_path_to_html(iter):\n",
" for fpath in iter:\n",
" for zz in read_s3_rows(fpath):\n",
" for zz in read_s3_rows(fpath, use_stream=True):\n",
" try:\n",
" detail_datas = json_loads(zz.value)\n",
" except:\n",
Expand Down Expand Up @@ -167,7 +167,7 @@
" StructField(\"track_id\", StringType(), True),\n",
"])\n",
"\n",
"dump_ods_df_with_struct = unique_id_df.withColumn(\"jsocn_strut\", from_json(unique_id_df.value, unique_schema))\n",
"dump_ods_df_with_struct = unique_id_df.withColumn(\"json_struct\", from_json(unique_id_df.value, unique_schema))\n",
"unique_id_v_df = dump_ods_df_with_struct.select(\"json_struct.*\")"
]
},
Expand Down
75 changes: 67 additions & 8 deletions jupyter/layout-clustering/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,69 @@
# 聚类参数如下
# layout 聚类

## 流程方案

![img.png](assets/img.png)

## 输入参数

```
DATA_SIZE_PER_BATCH: 每个批次的数据量

MAX_LAYOUTLIST_SIZE: 聚类列表最大长度
SIMILARITY_THRESHOLD: 相似度阈值
TIMEOUT_SECONDS: 聚类超时时间
SIM_TIMEOUT_SECONDS: 相似度超时时间
MAX_OUTPUT_ROW_SIZE: 输出数据行大小限制
MAX_OUTPUT_FILE_SIZE: 输出文件大小限制
NUM_PARTITIONS: spark分区设置
WRITE_NUM_PARTITIONS: spark 相似度写入分区设置
RATE_MAP: 聚类数据筛选标准

ERROR_PATH: 异常日志地址
INPUT_PATH: 第一步 choose domain 输入数据地址
CHOOSE_OUTPUT_PATH: 第一步choose domain输出数据地址
BASE_LAYOUT_OUTPUT_PATH: 第二步layout输出数据基础地址
BASE_SIM_OUTPUT_PATH: 第三步sim输出数据基础地址
BASE_INDEX_OUTPUT_PATH: 第四步index输出数据基础地址
BASE_DOMAIN_PATH: 过程中 valid domain 存储地址
BASE_BATCH_PATH: 过程中 batch layout 存储地址

INPUT_PATH: choose domain 输入数据地址
CHOOSE_DOMAIN_OUTPUT_PATH: choose domain输出数据地址

CLUSTER_LAYOUT_BASE_OUTPUT_PATH: layout输出数据基础地址
BASE_DOMAIN_PATH: 过程中 valid domain 数据基础地址
BASE_BATCH_PATH: 过程中 batch layout 数据基础地址

LAYOUT_SIM_BASE_OUTPUT_PATH: layout data输出数据基础地址

LAYOUT_INDEX_BASE_OUTPUT_PATH: layout index输出数据基础地址
```

## 输出数据结构

#### layout data

```json
{
"track_id": "31f5a241-f6d8-40f5-93c8-72b61c5dd790",
"html": "<!DOCTYPE html>\n<html>...</html>",
"url": "http://www.test.com/",
"layout_id": "www.test.com_0",
"max_layer_n": 6,
"url_host_name": "www.test.com",
"raw_warc_path": "s3://xxx"
}
```

#### layout index

```json
{
"layout_id": "www.test.com_0",
"url_host_name": "www.test.com",
"count": 50,
"files": [
{
"filepath": "s3://xxx", // 文件路径
"offset": 0, // 该layout_id数据在文件中的offset
"length": 50, // 该layout_id数据在文件中的length
"record_count": 50, // 该filepath 记录该layout_id的数量
"timestamp": "2025-06-09 20:04:04" // 最后更新时间戳
}
]
}
```
Binary file added jupyter/layout-clustering/assets/img.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading