Skip to content

intsig-textin/acge_text_embedding

Repository files navigation

pipeline_tag tags model-index
sentence-similarity
mteb
sentence-transformers
feature-extraction
sentence-similarity
name results
acge_text_embedding
task dataset metrics
type
STS
type name config split revision
C-MTEB/AFQMC
MTEB AFQMC
default
validation
b44c3b011063adb25877c13823db83bb193913c4
type value
cos_sim_pearson
54.03434872650919
type value
cos_sim_spearman
58.80730796688325
type value
euclidean_pearson
57.47231387497989
type value
euclidean_spearman
58.80775026351807
type value
manhattan_pearson
57.46332720141574
type value
manhattan_spearman
58.80196022940078
task dataset metrics
type
STS
type name config split revision
C-MTEB/ATEC
MTEB ATEC
default
test
0f319b1142f28d00e055a6770f3f726ae9b7d865
type value
cos_sim_pearson
53.52621290548175
type value
cos_sim_spearman
57.945227768312144
type value
euclidean_pearson
61.17041394151802
type value
euclidean_spearman
57.94553287835657
type value
manhattan_pearson
61.168327500057885
type value
manhattan_spearman
57.94477516925043
task dataset metrics
type
Classification
type name config split revision
mteb/amazon_reviews_multi
MTEB AmazonReviewsClassification (zh)
zh
test
1399c76144fd37290681b995c656ef9b2e06e26d
type value
accuracy
48.538000000000004
type value
f1
46.59920995594044
task dataset metrics
type
STS
type name config split revision
C-MTEB/BQ
MTEB BQ
default
test
e3dda5e115e487b39ec7e618c0c6a29137052a55
type value
cos_sim_pearson
68.27529991817154
type value
cos_sim_spearman
70.37095914176643
type value
euclidean_pearson
69.42690712802727
type value
euclidean_spearman
70.37017971889912
type value
manhattan_pearson
69.40264877917839
type value
manhattan_spearman
70.34786744049524
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/CLSClusteringP2P
MTEB CLSClusteringP2P
default
test
4b6227591c6c1a73bc76b1055f3b7f3588e72476
type value
v_measure
47.08027536192709
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/CLSClusteringS2S
MTEB CLSClusteringS2S
default
test
e458b3f5414b62b7f9f83499ac1f5497ae2e869f
type value
v_measure
44.0526024940363
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/CMedQAv1-reranking
MTEB CMedQAv1
default
test
8d7f1e942507dac42dc58017c1a001c3717da7df
type value
map
88.65974993133156
type value
mrr
90.64761904761905
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/CMedQAv2-reranking
MTEB CMedQAv2
default
test
23d186750531a14a0357ca22cd92d712fd512ea0
type value
map
88.90396838907245
type value
mrr
90.90932539682541
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/CmedqaRetrieval
MTEB CmedqaRetrieval
default
dev
cd540c506dae1cf9e9a59c3e06f42030d54e7301
type value
map_at_1
26.875
type value
map_at_10
39.995999999999995
type value
map_at_100
41.899
type value
map_at_1000
42.0
type value
map_at_3
35.414
type value
map_at_5
38.019
type value
mrr_at_1
40.635
type value
mrr_at_10
48.827
type value
mrr_at_100
49.805
type value
mrr_at_1000
49.845
type value
mrr_at_3
46.145
type value
mrr_at_5
47.693999999999996
type value
ndcg_at_1
40.635
type value
ndcg_at_10
46.78
type value
ndcg_at_100
53.986999999999995
type value
ndcg_at_1000
55.684
type value
ndcg_at_3
41.018
type value
ndcg_at_5
43.559
type value
precision_at_1
40.635
type value
precision_at_10
10.427999999999999
type value
precision_at_100
1.625
type value
precision_at_1000
0.184
type value
precision_at_3
23.139000000000003
type value
precision_at_5
17.004
type value
recall_at_1
26.875
type value
recall_at_10
57.887
type value
recall_at_100
87.408
type value
recall_at_1000
98.721
type value
recall_at_3
40.812
type value
recall_at_5
48.397
task dataset metrics
type
PairClassification
type name config split revision
C-MTEB/CMNLI
MTEB Cmnli
default
validation
41bc36f332156f7adc9e38f53777c959b2ae9766
type value
cos_sim_accuracy
83.43956704750451
type value
cos_sim_ap
90.49172854352659
type value
cos_sim_f1
84.28475486903963
type value
cos_sim_precision
80.84603822203135
type value
cos_sim_recall
88.02899228431144
type value
dot_accuracy
83.43956704750451
type value
dot_ap
90.46317132695233
type value
dot_f1
84.28794294628929
type value
dot_precision
80.51948051948052
type value
dot_recall
88.4264671498714
type value
euclidean_accuracy
83.43956704750451
type value
euclidean_ap
90.49171785256486
type value
euclidean_f1
84.28235820561584
type value
euclidean_precision
80.8022308022308
type value
euclidean_recall
88.07575403320084
type value
manhattan_accuracy
83.55983162958509
type value
manhattan_ap
90.48046779812815
type value
manhattan_f1
84.45354259069714
type value
manhattan_precision
82.21877767936226
type value
manhattan_recall
86.81318681318682
type value
max_accuracy
83.55983162958509
type value
max_ap
90.49172854352659
type value
max_f1
84.45354259069714
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/CovidRetrieval
MTEB CovidRetrieval
default
dev
1271c7809071a13532e05f25fb53511ffce77117
type value
map_at_1
68.54599999999999
type value
map_at_10
77.62400000000001
type value
map_at_100
77.886
type value
map_at_1000
77.89
type value
map_at_3
75.966
type value
map_at_5
76.995
type value
mrr_at_1
68.915
type value
mrr_at_10
77.703
type value
mrr_at_100
77.958
type value
mrr_at_1000
77.962
type value
mrr_at_3
76.08
type value
mrr_at_5
77.118
type value
ndcg_at_1
68.809
type value
ndcg_at_10
81.563
type value
ndcg_at_100
82.758
type value
ndcg_at_1000
82.864
type value
ndcg_at_3
78.29
type value
ndcg_at_5
80.113
type value
precision_at_1
68.809
type value
precision_at_10
9.463000000000001
type value
precision_at_100
1.001
type value
precision_at_1000
0.101
type value
precision_at_3
28.486
type value
precision_at_5
18.019
type value
recall_at_1
68.54599999999999
type value
recall_at_10
93.625
type value
recall_at_100
99.05199999999999
type value
recall_at_1000
99.895
type value
recall_at_3
84.879
type value
recall_at_5
89.252
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/DuRetrieval
MTEB DuRetrieval
default
dev
a1a333e290fe30b10f3f56498e3a0d911a693ced
type value
map_at_1
25.653
type value
map_at_10
79.105
type value
map_at_100
81.902
type value
map_at_1000
81.947
type value
map_at_3
54.54599999999999
type value
map_at_5
69.226
type value
mrr_at_1
89.35
type value
mrr_at_10
92.69
type value
mrr_at_100
92.77
type value
mrr_at_1000
92.774
type value
mrr_at_3
92.425
type value
mrr_at_5
92.575
type value
ndcg_at_1
89.35
type value
ndcg_at_10
86.55199999999999
type value
ndcg_at_100
89.35300000000001
type value
ndcg_at_1000
89.782
type value
ndcg_at_3
85.392
type value
ndcg_at_5
84.5
type value
precision_at_1
89.35
type value
precision_at_10
41.589999999999996
type value
precision_at_100
4.781
type value
precision_at_1000
0.488
type value
precision_at_3
76.683
type value
precision_at_5
65.06
type value
recall_at_1
25.653
type value
recall_at_10
87.64999999999999
type value
recall_at_100
96.858
type value
recall_at_1000
99.13300000000001
type value
recall_at_3
56.869
type value
recall_at_5
74.024
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/EcomRetrieval
MTEB EcomRetrieval
default
dev
687de13dc7294d6fd9be10c6945f9e8fec8166b9
type value
map_at_1
52.1
type value
map_at_10
62.629999999999995
type value
map_at_100
63.117000000000004
type value
map_at_1000
63.134
type value
map_at_3
60.267
type value
map_at_5
61.777
type value
mrr_at_1
52.1
type value
mrr_at_10
62.629999999999995
type value
mrr_at_100
63.117000000000004
type value
mrr_at_1000
63.134
type value
mrr_at_3
60.267
type value
mrr_at_5
61.777
type value
ndcg_at_1
52.1
type value
ndcg_at_10
67.596
type value
ndcg_at_100
69.95
type value
ndcg_at_1000
70.33500000000001
type value
ndcg_at_3
62.82600000000001
type value
ndcg_at_5
65.546
type value
precision_at_1
52.1
type value
precision_at_10
8.309999999999999
type value
precision_at_100
0.941
type value
precision_at_1000
0.097
type value
precision_at_3
23.400000000000002
type value
precision_at_5
15.36
type value
recall_at_1
52.1
type value
recall_at_10
83.1
type value
recall_at_100
94.1
type value
recall_at_1000
97.0
type value
recall_at_3
70.19999999999999
type value
recall_at_5
76.8
task dataset metrics
type
Classification
type name config split revision
C-MTEB/IFlyTek-classification
MTEB IFlyTek
default
validation
421605374b29664c5fc098418fe20ada9bd55f8a
type value
accuracy
51.773759138130046
type value
f1
40.341407912920054
task dataset metrics
type
Classification
type name config split revision
C-MTEB/JDReview-classification
MTEB JDReview
default
test
b7c64bd89eb87f8ded463478346f76731f07bf8b
type value
accuracy
86.69793621013133
type value
ap
55.46718958939327
type value
f1
81.48228915952436
task dataset metrics
type
STS
type name config split revision
C-MTEB/LCQMC
MTEB LCQMC
default
test
17f9b096f80380fce5ed12a9be8be7784b337daf
type value
cos_sim_pearson
71.1397780205448
type value
cos_sim_spearman
78.17368193033309
type value
euclidean_pearson
77.4849177602368
type value
euclidean_spearman
78.17369079663212
type value
manhattan_pearson
77.47344305182406
type value
manhattan_spearman
78.16454335155387
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/Mmarco-reranking
MTEB MMarcoReranking
default
dev
8e0c766dbe9e16e1d221116a3f36795fbade07f6
type value
map
27.76160559006673
type value
mrr
28.02420634920635
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/MMarcoRetrieval
MTEB MMarcoRetrieval
default
dev
539bbde593d947e2a124ba72651aafc09eb33fc2
type value
map_at_1
65.661
type value
map_at_10
74.752
type value
map_at_100
75.091
type value
map_at_1000
75.104
type value
map_at_3
72.997
type value
map_at_5
74.119
type value
mrr_at_1
67.923
type value
mrr_at_10
75.376
type value
mrr_at_100
75.673
type value
mrr_at_1000
75.685
type value
mrr_at_3
73.856
type value
mrr_at_5
74.82799999999999
type value
ndcg_at_1
67.923
type value
ndcg_at_10
78.424
type value
ndcg_at_100
79.95100000000001
type value
ndcg_at_1000
80.265
type value
ndcg_at_3
75.101
type value
ndcg_at_5
76.992
type value
precision_at_1
67.923
type value
precision_at_10
9.474
type value
precision_at_100
1.023
type value
precision_at_1000
0.105
type value
precision_at_3
28.319
type value
precision_at_5
17.986
type value
recall_at_1
65.661
type value
recall_at_10
89.09899999999999
type value
recall_at_100
96.023
type value
recall_at_1000
98.455
type value
recall_at_3
80.314
type value
recall_at_5
84.81
task dataset metrics
type
Classification
type name config split revision
mteb/amazon_massive_intent
MTEB MassiveIntentClassification (zh-CN)
zh-CN
test
31efe3c427b0bae9c22cbb560b8f15491cc6bed7
type value
accuracy
75.86751849361131
type value
f1
73.04918450508
task dataset metrics
type
Classification
type name config split revision
mteb/amazon_massive_scenario
MTEB MassiveScenarioClassification (zh-CN)
zh-CN
test
7d571f92784cd94a019292a1f45445077d0ef634
type value
accuracy
78.4364492266308
type value
f1
78.120686034844
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/MedicalRetrieval
MTEB MedicalRetrieval
default
dev
2039188fb5800a9803ba5048df7b76e6fb151fc6
type value
map_at_1
55.00000000000001
type value
map_at_10
61.06399999999999
type value
map_at_100
61.622
type value
map_at_1000
61.663000000000004
type value
map_at_3
59.583
type value
map_at_5
60.373
type value
mrr_at_1
55.2
type value
mrr_at_10
61.168
type value
mrr_at_100
61.726000000000006
type value
mrr_at_1000
61.767
type value
mrr_at_3
59.683
type value
mrr_at_5
60.492999999999995
type value
ndcg_at_1
55.00000000000001
type value
ndcg_at_10
64.098
type value
ndcg_at_100
67.05
type value
ndcg_at_1000
68.262
type value
ndcg_at_3
61.00600000000001
type value
ndcg_at_5
62.439
type value
precision_at_1
55.00000000000001
type value
precision_at_10
7.37
type value
precision_at_100
0.881
type value
precision_at_1000
0.098
type value
precision_at_3
21.7
type value
precision_at_5
13.719999999999999
type value
recall_at_1
55.00000000000001
type value
recall_at_10
73.7
type value
recall_at_100
88.1
type value
recall_at_1000
97.8
type value
recall_at_3
65.10000000000001
type value
recall_at_5
68.60000000000001
task dataset metrics
type
Classification
type name config split revision
C-MTEB/MultilingualSentiment-classification
MTEB MultilingualSentiment
default
validation
46958b007a63fdbf239b7672c25d0bea67b5ea1a
type value
accuracy
77.52666666666667
type value
f1
77.49784731367215
task dataset metrics
type
PairClassification
type name config split revision
C-MTEB/OCNLI
MTEB Ocnli
default
validation
66e76a618a34d6d565d5538088562851e6daa7ec
type value
cos_sim_accuracy
81.10449377368705
type value
cos_sim_ap
85.17742765935606
type value
cos_sim_f1
83.00094966761633
type value
cos_sim_precision
75.40983606557377
type value
cos_sim_recall
92.29144667370645
type value
dot_accuracy
81.10449377368705
type value
dot_ap
85.17143850809614
type value
dot_f1
83.01707779886148
type value
dot_precision
75.36606373815677
type value
dot_recall
92.39704329461456
type value
euclidean_accuracy
81.10449377368705
type value
euclidean_ap
85.17856775343333
type value
euclidean_f1
83.00094966761633
type value
euclidean_precision
75.40983606557377
type value
euclidean_recall
92.29144667370645
type value
manhattan_accuracy
81.05035192203573
type value
manhattan_ap
85.14464459395809
type value
manhattan_f1
82.96155671570953
type value
manhattan_precision
75.3448275862069
type value
manhattan_recall
92.29144667370645
type value
max_accuracy
81.10449377368705
type value
max_ap
85.17856775343333
type value
max_f1
83.01707779886148
task dataset metrics
type
Classification
type name config split revision
C-MTEB/OnlineShopping-classification
MTEB OnlineShopping
default
test
e610f2ebd179a8fda30ae534c3878750a96db120
type value
accuracy
93.71000000000001
type value
ap
91.83202232349356
type value
f1
93.69900560334331
task dataset metrics
type
STS
type name config split revision
C-MTEB/PAWSX
MTEB PAWSX
default
test
9c6a90e430ac22b5779fb019a23e820b11a8b5e1
type value
cos_sim_pearson
39.175047651512415
type value
cos_sim_spearman
45.51434675777896
type value
euclidean_pearson
44.864110004132286
type value
euclidean_spearman
45.516433048896076
type value
manhattan_pearson
44.87153627706517
type value
manhattan_spearman
45.52862617925012
task dataset metrics
type
STS
type name config split revision
C-MTEB/QBQTC
MTEB QBQTC
default
test
790b0510dc52b1553e8c49f3d2afb48c0e5c48b7
type value
cos_sim_pearson
34.249579701429084
type value
cos_sim_spearman
37.30903127368978
type value
euclidean_pearson
35.129438425253355
type value
euclidean_spearman
37.308544018709085
type value
manhattan_pearson
35.08936153503652
type value
manhattan_spearman
37.25582901077839
task dataset metrics
type
STS
type name config split revision
mteb/sts22-crosslingual-sts
MTEB STS22 (zh)
zh
test
eea2b4fe26a775864c896887d910b76a8098ad3f
type value
cos_sim_pearson
61.29309637460004
type value
cos_sim_spearman
65.85136090376717
type value
euclidean_pearson
64.04783990953557
type value
euclidean_spearman
65.85036859610366
type value
manhattan_pearson
63.995852552712186
type value
manhattan_spearman
65.86508416749417
task dataset metrics
type
STS
type name config split revision
C-MTEB/STSB
MTEB STSB
default
test
0cde68302b3541bb8b3c340dc0644b0b745b3dc0
type value
cos_sim_pearson
81.5595940455587
type value
cos_sim_spearman
82.72654634579749
type value
euclidean_pearson
82.4892721061365
type value
euclidean_spearman
82.72678504228253
type value
manhattan_pearson
82.4770861422454
type value
manhattan_spearman
82.71137469783162
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/T2Reranking
MTEB T2Reranking
default
dev
76631901a18387f85eaa53e5450019b87ad58ef9
type value
map
66.6159547610527
type value
mrr
76.35739406347057
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/T2Retrieval
MTEB T2Retrieval
default
dev
8731a845f1bf500a4f111cf1070785c793d10e64
type value
map_at_1
27.878999999999998
type value
map_at_10
77.517
type value
map_at_100
81.139
type value
map_at_1000
81.204
type value
map_at_3
54.728
type value
map_at_5
67.128
type value
mrr_at_1
90.509
type value
mrr_at_10
92.964
type value
mrr_at_100
93.045
type value
mrr_at_1000
93.048
type value
mrr_at_3
92.551
type value
mrr_at_5
92.81099999999999
type value
ndcg_at_1
90.509
type value
ndcg_at_10
85.075
type value
ndcg_at_100
88.656
type value
ndcg_at_1000
89.25699999999999
type value
ndcg_at_3
86.58200000000001
type value
ndcg_at_5
85.138
type value
precision_at_1
90.509
type value
precision_at_10
42.05
type value
precision_at_100
5.013999999999999
type value
precision_at_1000
0.516
type value
precision_at_3
75.551
type value
precision_at_5
63.239999999999995
type value
recall_at_1
27.878999999999998
type value
recall_at_10
83.941
type value
recall_at_100
95.568
type value
recall_at_1000
98.55000000000001
type value
recall_at_3
56.374
type value
recall_at_5
70.435
task dataset metrics
type
Classification
type name config split revision
C-MTEB/TNews-classification
MTEB TNews
default
validation
317f262bf1e6126357bbe89e875451e4b0938fe4
type value
accuracy
53.687
type value
f1
51.86911933364655
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/ThuNewsClusteringP2P
MTEB ThuNewsClusteringP2P
default
test
5798586b105c0434e4f0fe5e767abe619442cf93
type value
v_measure
74.65887489872564
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/ThuNewsClusteringS2S
MTEB ThuNewsClusteringS2S
default
test
8a8b2caeda43f39e13c4bc5bea0f8a667896e10d
type value
v_measure
69.00410995984436
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/VideoRetrieval
MTEB VideoRetrieval
default
dev
58c2597a5943a2ba48f4668c3b90d796283c5639
type value
map_at_1
59.4
type value
map_at_10
69.214
type value
map_at_100
69.72699999999999
type value
map_at_1000
69.743
type value
map_at_3
67.717
type value
map_at_5
68.782
type value
mrr_at_1
59.4
type value
mrr_at_10
69.214
type value
mrr_at_100
69.72699999999999
type value
mrr_at_1000
69.743
type value
mrr_at_3
67.717
type value
mrr_at_5
68.782
type value
ndcg_at_1
59.4
type value
ndcg_at_10
73.32300000000001
type value
ndcg_at_100
75.591
type value
ndcg_at_1000
75.98700000000001
type value
ndcg_at_3
70.339
type value
ndcg_at_5
72.246
type value
precision_at_1
59.4
type value
precision_at_10
8.59
type value
precision_at_100
0.96
type value
precision_at_1000
0.099
type value
precision_at_3
25.967000000000002
type value
precision_at_5
16.5
type value
recall_at_1
59.4
type value
recall_at_10
85.9
type value
recall_at_100
96.0
type value
recall_at_1000
99.1
type value
recall_at_3
77.9
type value
recall_at_5
82.5
task dataset metrics
type
Classification
type name config split revision
C-MTEB/waimai-classification
MTEB Waimai
default
test
339287def212450dcaa9df8c22bf93e9980c7023
type value
accuracy
88.53
type value
ap
73.56216166534062
type value
f1
87.06093694294485
icon

acge model

acge模型来自于合合信息技术团队,对外技术试用平台TextIn, github开源链接为github。合合信息是行业领先的人工智能及大数据科技企业,致力于通过智能文字识别及商业大数据领域的核心技术、C端和B端产品以及行业解决方案为全球企业和个人用户提供创新的数字化、智能化服务。

技术交流请联系yanhui_he@intsig.net,商务合作请联系simon_liu@intsig.net,可以点击图片,扫面二维码来加入我们的微信社群。想加入合合信息,做“文档解析”、“文档检索”、“文档预研”的同学可以投简历给min_du@intsig.net,也可直接添加HR微信详聊岗位内容。

acge是一个通用的文本编码模型,是一个可变长度的向量化模型,使用了Matryoshka Representation Learning,如图所示:

matryoshka-small

建议使用的维度为1024或者1792

Model Name Model Size (GB) Dimension Sequence Length Language Need instruction for retrieval?
acge-text-embedding 0.65 [1024, 1792] 1024 Chinese NO

Metric

C-MTEB leaderboard (Chinese)

测试的时候因为数据的随机性、显卡、推理的数据类型导致每次推理的结果不一致,我总共测试了4次,不同的显卡(A10 A100),不同的数据类型,测试结果放在了result文件夹中,选取了一个精度最低的测试作为最终的精度测试。 根据infgrad的建议,选取不用的输入的长度作为测试,Sequence Length为512时测试最佳。

Model Name GPU tensor-type Model Size (GB) Dimension Sequence Length Average (35) Classification (9) Clustering (4) Pair Classification (2) Reranking (4) Retrieval (8) STS (8)
acge_text_embedding NVIDIA TESLA A10 bfloat16 0.65 1792 1024 68.91 72.76 58.22 87.82 67.67 72.48 62.24
acge_text_embedding NVIDIA TESLA A100 bfloat16 0.65 1792 1024 68.91 72.77 58.35 87.82 67.53 72.48 62.24
acge_text_embedding NVIDIA TESLA A100 float16 0.65 1792 1024 68.99 72.76 58.68 87.84 67.89 72.49 62.24
acge_text_embedding NVIDIA TESLA A100 float32 0.65 1792 1024 68.98 72.76 58.58 87.83 67.91 72.49 62.24
acge_text_embedding NVIDIA TESLA A100 float16 0.65 1792 768 68.95 72.76 58.68 87.84 67.86 72.48 62.07
acge_text_embedding NVIDIA TESLA A100 float16 0.65 1792 512 69.07 72.75 58.7 87.84 67.99 72.93 62.09

Reproduce our results

C-MTEB:

import torch
import argparse
import functools
from C_MTEB.tasks import *
from typing import List, Dict
from sentence_transformers import SentenceTransformer
from mteb import MTEB, DRESModel


class RetrievalModel(DRESModel):
    def __init__(self, encoder, **kwargs):
        self.encoder = encoder

    def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
        input_texts = ['{}'.format(q) for q in queries]
        return self._do_encode(input_texts)

    def encode_corpus(self, corpus: List[Dict[str, str]], **kwargs) -> np.ndarray:
        input_texts = ['{} {}'.format(doc.get('title', ''), doc['text']).strip() for doc in corpus]
        input_texts = ['{}'.format(t) for t in input_texts]
        return self._do_encode(input_texts)

    @torch.no_grad()
    def _do_encode(self, input_texts: List[str]) -> np.ndarray:
        return self.encoder.encode(
            sentences=input_texts,
            batch_size=512,
            normalize_embeddings=True,
            convert_to_numpy=True
        )


def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name_or_path', default="acge_text_embedding", type=str)
    parser.add_argument('--task_type', default=None, type=str)
    parser.add_argument('--pooling_method', default='cls', type=str)
    parser.add_argument('--output_dir', default='zh_results',
                        type=str, help='output directory')
    parser.add_argument('--max_len', default=1024, type=int, help='max length')
    return parser.parse_args()


if __name__ == '__main__':
    args = get_args()
    encoder = SentenceTransformer(args.model_name_or_path).half()
    encoder.encode = functools.partial(encoder.encode, normalize_embeddings=True)
    encoder.max_seq_length = int(args.max_len)

    task_names = [t.description["name"] for t in MTEB(task_types=args.task_type,
                                                      task_langs=['zh', 'zh-CN']).tasks]
    TASKS_WITH_PROMPTS = ["T2Retrieval", "MMarcoRetrieval", "DuRetrieval", "CovidRetrieval", "CmedqaRetrieval",
                          "EcomRetrieval", "MedicalRetrieval", "VideoRetrieval"]
    for task in task_names:
        evaluation = MTEB(tasks=[task], task_langs=['zh', 'zh-CN'])
        if task in TASKS_WITH_PROMPTS:
            evaluation.run(RetrievalModel(encoder), output_folder=args.output_dir, overwrite_results=False)
        else:
            evaluation.run(encoder, output_folder=args.output_dir, overwrite_results=False)

Usage

acge 中文系列模型

在sentence-transformer库中的使用方法:

from sentence_transformers import SentenceTransformer

sentences = ["数据1", "数据2"]
model = SentenceTransformer('acge_text_embedding')
print(model.max_seq_length)
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

在sentence-transformer库中的使用方法,选取不同的维度:

from sklearn.preprocessing import normalize
from sentence_transformers import SentenceTransformer

sentences = ["数据1", "数据2"]
model = SentenceTransformer('acge_text_embedding')
embeddings = model.encode(sentences, normalize_embeddings=False)
matryoshka_dim = 1024
embeddings = embeddings[..., :matryoshka_dim]  # Shrink the embedding dimensions
embeddings = normalize(embeddings, norm="l2", axis=1)
print(embeddings.shape)
# => (2, 1024)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published