-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
884 lines (851 loc) · 45.8 KB
/
index.html
File metadata and controls
884 lines (851 loc) · 45.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="description"
content="Deformable Neural Radiance Fields creates free-viewpoint portraits (nerfies) from casually captured videos.">
<meta name="keywords" content="Nerfies, D-NeRF, NeRF">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents</title>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-PYVRSFMDRL"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag('js', new Date());
gtag('config', 'G-PYVRSFMDRL');
</script>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./images/logo_ffai.svg">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">
FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents
</h1>
<div class="is-size-5 publication-authors">
<span class="author-block"><a href="https://www.libobo.site/">Bobo Li</a><sup>#</sup><sup>1</sup>,</span>
<span class="author-block"><a href="https://yooheng416.com/">Yuheng Wang</a><sup>#</sup><sup>2</sup>,</span>
<span class="author-block"><a href="https://haofei.vip/">Hao Fei</a><sup>1</sup>,</span>
<span class="author-block"><a href="https://person.zju.edu.cn/juncheng/">Juncheng Li</a><sup>3</sup>,</span>
<span class="author-block"><a href="https://jiwei0523.github.io/">Wei Ji</a><sup>4</sup>,</span>
<span class="author-block"><a href="https://www.comp.nus.edu.sg/~leeml/">Mong-Li Lee</a><sup>1</sup>,</span>
<span class="author-block"><a href="https://www.comp.nus.edu.sg/~whsu/">Wynne Hsu</a><sup>1</sup></span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block"><sup>1</sup>National University of Singapore,</span>
<span class="author-block"><sup>2</sup>Wuhan University,</span>
<span class="author-block"><sup>3</sup>Zhejiang University,</span>
<span class="author-block"><sup>4</sup>Nanjing University</span>
</div>
<div class="content has-text-centered" style="margin-bottom: 0.5em;">
<p style="font-size: 0.85em; color: #4a4a4a; margin-bottom: 0.25em;">
<sup>#</sup> indicates equal contribution.
</p>
<div class="publication-links" style="margin-top: 0.3em;">
<!-- PDF Link -->
<span class="link-block">
<a href="https://arxiv.org/pdf/2506.01520"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<!-- arXiv Link -->
<span class="link-block">
<a href="https://arxiv.org/abs/2506.01520"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="ai ai-arxiv"></i>
</span>
<span>arXiv</span>
</a>
</span>
<!-- Video Link -->
<span class="link-block">
<a href="#demo-video"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-youtube"></i>
</span>
<span>Video</span>
</a>
</span>
<!-- Code Link -->
<span class="link-block">
<a href="https://github.com/formfactory-ai/formfactory"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
<!-- Data Link -->
<span class="link-block">
<a href="https://github.com/formfactory-ai/formfactory"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="far fa-images"></i>
</span>
<span>Data</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Overview Figure -->
<div class="columns is-centered">
<div class="column is-full">
<div class="content has-text-justified">
<figure class="image" style="margin: 2rem auto;">
<img src="images/fig1v1_page_1.png" alt="FormFactory Overview" style="max-width: 95%; margin: 0 auto;">
<figcaption class="has-text-centered" style="margin-top: 1rem; color: #4a4a4a; font-size: 0.95em;">
Figure 1: Overview of the form-filling task and its challenges. Compared to general GUI tasks, form-filling involves more diverse applications and demands higher semantic understanding, layout flexibility, and interaction complexity.
</figcaption>
</figure>
</div>
</div>
</div>
<!-- Abstract. -->
<div class="columns is-centered has-text-centered" style="margin-top: 4rem;">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions.
Despite the long-standing vision of automating this process with "one click," existing tools remain largely rule-based and lack generalizable, generative capabilities.
Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios.
However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields.
To bridge this gap, we formally define the form-filling task and propose FormFactory—an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset.
Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions.
We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task.
These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities.
We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.
</div>
</div>
</div>
<!--/ Abstract. -->
<!-- Benchmark Construction -->
<div class="columns is-centered has-text-centered" style="margin-top: 4rem;">
<div class="column is-full">
<h2 class="title is-3">Benchmark Construction</h2>
<div class="content has-text-justified" style="max-width: 95%; margin: 0 auto 1rem auto;">
<p>
We developed a high-fidelity browser-based interaction platform using Python and Flask to evaluate the performance of form-filling agents. This interactive platform enables users to complete various form-filling tasks directly within a web interface, while the backend system performs real-time evaluation by automatically comparing submitted values against gold-standard field annotations. The platform comprises 20 web forms spanning eight real-world domains, including academia, finance, healthcare, and information technology. It supports a diverse range of input types, such as text fields, dropdown menus, date pickers, checkboxes, file uploads, and numerical inputs.
</div>
<div class="content has-text-justified" style="max-width: 95%; margin: 0 auto 1rem auto;">
<p>
To simulate the complexity of real-world deployment environments, the platform incorporates a wide variety of page layouts, font styles, and color schemes. It also supports multi-page forms and modular field definitions, allowing for flexible configuration of diverse evaluation scenarios. This platform provides a realistic and compositional UI environment for large-scale, reproducible evaluation of both human annotators and multimodal models.
</p>
</div>
<!-- Video -->
<div class="columns is-centered">
<div class="column is-full-width">
<div class="content has-text-centered">
<video id="demo-video" controls autoplay loop muted playsinline style="width: 100%; max-width: 800px; margin: 0 auto;">
<source src="video.mov" type="video/mp4">
Your browser does not support the video tag.
</video>
<p class="caption" style="margin-top: 1rem; color: #4a4a4a; font-size: 0.95em;">
Demo video of the interaction platform.
</p>
</div>
</div>
</div>
<div class="content has-text-justified" style="max-width: 95%; margin: 0 auto 1rem auto;">
<p>
Additionally, we constructed a dataset of 1,250 form-filling instances using a generative approach. Each instance consists of an input document or descriptive text paired with structured field-value annotations. The inputs include real or LLM-generated resumes, academic papers, leave requests, registration forms, and other free-form text formats. We first sample gold-standard field values from the form templates, and then prompt a large language model to generate natural language inputs that implicitly contain these values, thereby simulating realistic user behavior. For example, in job application scenarios, we generate detailed Markdown-formatted resumes, while in academic contexts, metadata is extracted from real research papers.
</p>
</div>
<div class="content has-text-justified" style="max-width: 95%; margin: 0 auto 1rem auto;">
<p>
The final dataset comprises 13,800 field-value pairs, encompassing a wide variety of field types and input modalities. It presents significant challenges for models in terms of language understanding and layout reasoning. Detailed statistics of the dataset are provided in Table 1.
</p>
</div>
<div class="content" style="overflow-x: auto;">
<style>
.stats-table {
width: 100%;
border-collapse: collapse;
margin: 0 0 20px 0;
font-size: 0.9em;
box-shadow: 0 2px 15px rgba(0,0,0,0.05);
border-radius: 8px;
overflow: hidden;
}
.stats-table th, .stats-table td {
padding: 12px;
border: 1px solid #f0f0f0;
text-align: center;
vertical-align: middle;
}
.stats-table th {
background-color: #f8f9fa;
color: #495057;
font-weight: 600;
text-transform: uppercase;
font-size: 0.85em;
letter-spacing: 0.5px;
}
.stats-table .category-cell {
background-color: #f1f3f5;
color: #495057;
font-weight: 700;
text-align: center;
vertical-align: middle;
padding: 15px;
font-size: 1.1em;
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", Arial, sans-serif;
}
.stats-table td {
background-color: white;
color: #495057;
}
.stats-table tr:nth-child(even) td:not(.category-cell) {
background-color: #fbfbfd;
}
.stats-table tr:hover td:not(.category-cell) {
background-color: #f8f9fa;
}
.table-container {
max-width: 100%;
overflow-x: auto;
border-radius: 8px;
margin: 20px 0;
}
.stats-table td:not(.category-cell) {
transition: background-color 0.3s ease;
}
.stats-table td:not(:empty):not(.category-cell):not(.numeric-cell) {
color: #2ecc71;
font-weight: bold;
}
.numeric-cell {
font-family: 'Monaco', monospace;
color: #6c757d;
}
/* 为不同类别的Form Type添加颜色 */
.form-type-academic {
color: #2ecc71 !important; /* 柔和的绿色 */
font-weight: 500;
}
.form-type-business {
color: #e74c3c !important; /* 柔和的红色 */
font-weight: 500;
}
.form-type-arts {
color: #9b59b6 !important; /* 柔和的紫色 */
font-weight: 500;
}
.form-type-tech {
color: #3498db !important; /* 柔和的蓝色 */
font-weight: 500;
}
.form-type-finance {
color: #f1c40f !important; /* 柔和的黄色 */
font-weight: 500;
}
.form-type-healthcare {
color: #1abc9c !important; /* 柔和的青色 */
font-weight: 500;
}
.form-type-legal {
color: #e67e22 !important; /* 柔和的橙色 */
font-weight: 500;
}
.form-type-construction {
color: #34495e !important; /* 柔和的深蓝色 */
font-weight: 500;
}
/* Category背景色 */
.category-academic {
background-color: rgba(46, 204, 113, 0.1) !important;
color: #27ae60 !important;
}
.category-business {
background-color: rgba(231, 76, 60, 0.1) !important;
color: #c0392b !important;
}
.category-arts {
background-color: rgba(155, 89, 182, 0.1) !important;
color: #8e44ad !important;
}
.category-tech {
background-color: rgba(52, 152, 219, 0.1) !important;
color: #2980b9 !important;
}
.category-finance {
background-color: rgba(241, 196, 15, 0.1) !important;
color: #d4ac0d !important;
}
.category-healthcare {
background-color: rgba(26, 188, 156, 0.1) !important;
color: #16a085 !important;
}
.category-legal {
background-color: rgba(230, 126, 34, 0.1) !important;
color: #d35400 !important;
}
.category-construction {
background-color: rgba(52, 73, 94, 0.1) !important;
color: #2c3e50 !important;
}
</style>
<div class="table-container">
<table class="stats-table">
<thead>
<tr>
<th>Category</th>
<th>Form Type</th>
<th>Fields</th>
<th>Samples</th>
<th>Total Fields</th>
<th>Field Types</th>
<th>Text</th>
<th>Long Text</th>
<th>Number</th>
<th>Date</th>
<th>Selection</th>
<th>Time</th>
<th>File</th>
<th>URL</th>
<th>Required</th>
</tr>
</thead>
<tbody>
<!-- Academic & Research -->
<tr>
<td class="category-cell category-academic" rowspan="5">Academic & Research</td>
<td class="form-type-academic">Job Application for University Positions</td>
<td class="numeric-cell">4</td><td class="numeric-cell">50</td><td class="numeric-cell">200</td><td class="numeric-cell">2</td>
<td></td><td></td><td></td><td>✓</td><td></td><td></td><td></td><td></td><td></td>
</tr>
<tr>
<td class="form-type-academic">Grant or Research Funding Application</td>
<td class="numeric-cell">6</td><td class="numeric-cell">50</td><td class="numeric-cell">300</td><td class="numeric-cell">5</td>
<td>✓</td><td>✓</td><td></td><td></td><td>✓</td><td></td><td>✓</td><td></td><td></td>
</tr>
<tr>
<td class="form-type-academic">Paper Submission Form</td>
<td class="numeric-cell">7</td><td class="numeric-cell">50</td><td class="numeric-cell">300</td><td class="numeric-cell">3</td>
<td></td><td></td><td>✓</td><td></td><td>✓</td><td></td><td></td><td></td><td></td>
</tr>
<tr>
<td class="form-type-academic">Student Course Registration Form</td>
<td class="numeric-cell">8</td><td class="numeric-cell">50</td><td class="numeric-cell">400</td><td class="numeric-cell">4</td>
<td></td><td></td><td>✓</td><td>✓</td><td></td><td>✓</td><td></td><td></td><td></td>
</tr>
<tr>
<td class="form-type-academic">Scholarship Application for Students</td>
<td class="numeric-cell">16</td><td class="numeric-cell">50</td><td class="numeric-cell">800</td><td class="numeric-cell">4</td>
<td></td><td></td><td>✓</td><td>✓</td><td>✓</td><td></td><td></td><td></td><td>✓</td>
</tr>
<!-- Professional & Business -->
<tr>
<td class="category-cell category-business" rowspan="4">Professional & Business</td>
<td class="form-type-business">Startup Funding Application</td>
<td class="numeric-cell">18</td><td class="numeric-cell">50</td><td class="numeric-cell">900</td><td class="numeric-cell">6</td>
<td>✓</td><td></td><td>✓</td><td>✓</td><td>✓</td><td></td><td></td><td>✓</td><td>✓</td>
</tr>
<tr>
<td class="form-type-business">Real Estate Rental Application</td>
<td class="numeric-cell">22</td><td class="numeric-cell">50</td><td class="numeric-cell">1,100</td><td class="numeric-cell">6</td>
<td>✓</td><td></td><td>✓</td><td>✓</td><td>✓</td><td></td><td></td><td>✓</td><td>✓</td>
</tr>
<tr>
<td class="form-type-business">Educational Workshop Registration</td>
<td class="numeric-cell">17</td><td class="numeric-cell">50</td><td class="numeric-cell">850</td><td class="numeric-cell">4</td>
<td>✓</td><td></td><td>✓</td><td>✓</td><td></td><td></td><td></td><td></td><td>✓</td>
</tr>
<tr>
<td class="form-type-business">Association Membership Application</td>
<td class="numeric-cell">20</td><td class="numeric-cell">50</td><td class="numeric-cell">1,000</td><td class="numeric-cell">6</td>
<td>✓</td><td></td><td>✓</td><td>✓</td><td>✓</td><td></td><td>✓</td><td></td><td>✓</td>
</tr>
<!-- Arts & Creative -->
<tr>
<td class="category-cell category-arts" rowspan="3">Arts & Creative</td>
<td class="form-type-arts">Art Exhibition Submission Form</td>
<td class="numeric-cell">11</td><td class="numeric-cell">50</td><td class="numeric-cell">550</td><td class="numeric-cell">6</td>
<td></td><td></td><td>✓</td><td>✓</td><td>✓</td><td></td><td>✓</td><td>✓</td><td>✓</td>
</tr>
<tr>
<td class="form-type-arts">Literary Magazine Submission Form</td>
<td class="numeric-cell">11</td><td class="numeric-cell">50</td><td class="numeric-cell">550</td><td class="numeric-cell">5</td>
<td></td><td></td><td>✓</td><td>✓</td><td>✓</td><td></td><td>✓</td><td></td><td>✓</td>
</tr>
<tr>
<td class="form-type-arts">Conference Speaker Application Form</td>
<td class="numeric-cell">14</td><td class="numeric-cell">50</td><td class="numeric-cell">700</td><td class="numeric-cell">6</td>
<td></td><td>✓</td><td>✓</td><td>✓</td><td>✓</td><td></td><td>✓</td><td></td><td>✓</td>
</tr>
<!-- Technology & Software -->
<tr>
<td class="category-cell category-tech" rowspan="2">Technology & Software</td>
<td class="form-type-tech">Bug Reporting Form</td>
<td class="numeric-cell">10</td><td class="numeric-cell">50</td><td class="numeric-cell">500</td><td class="numeric-cell">4</td>
<td></td><td></td><td>✓</td><td>✓</td><td>✓</td><td></td><td></td><td></td><td>✓</td>
</tr>
<tr>
<td class="form-type-tech">IT Support Request Form</td>
<td class="numeric-cell">11</td><td class="numeric-cell">50</td><td class="numeric-cell">550</td><td class="numeric-cell">5</td>
<td></td><td></td><td>✓</td><td>✓</td><td>✓</td><td></td><td></td><td>✓</td><td>✓</td>
</tr>
<!-- Finance & Banking -->
<tr>
<td class="category-cell category-finance" rowspan="3">Finance & Banking</td>
<td class="form-type-finance">Personal Loan Application Form</td>
<td class="numeric-cell">7</td><td class="numeric-cell">50</td><td class="numeric-cell">350</td><td class="numeric-cell">3</td>
<td></td><td></td><td>✓</td><td></td><td></td><td></td><td></td><td>✓</td><td></td>
</tr>
<tr>
<td class="form-type-finance">Bank Account Opening Form</td>
<td class="numeric-cell">5</td><td class="numeric-cell">50</td><td class="numeric-cell">250</td><td class="numeric-cell">3</td>
<td>✓</td><td></td><td>✓</td><td></td><td></td><td></td><td></td><td></td><td></td>
</tr>
<tr>
<td class="form-type-finance">Financial Planning Consultation Form</td>
<td class="numeric-cell">6</td><td class="numeric-cell">50</td><td class="numeric-cell">300</td><td class="numeric-cell">4</td>
<td>✓</td><td></td><td>✓</td><td>✓</td><td></td><td></td><td></td><td></td><td></td>
</tr>
<!-- Healthcare & Medical -->
<tr>
<td class="category-cell category-healthcare" rowspan="3">Healthcare & Medical</td>
<td class="form-type-healthcare">Patient Consent for Surgery</td>
<td class="numeric-cell">8</td><td class="numeric-cell">50</td><td class="numeric-cell">400</td><td class="numeric-cell">3</td>
<td>✓</td><td></td><td></td><td></td><td></td><td></td><td>✓</td><td></td><td></td>
</tr>
<tr>
<td class="form-type-healthcare">Medical Research Study Enrollment</td>
<td class="numeric-cell">8</td><td class="numeric-cell">50</td><td class="numeric-cell">400</td><td class="numeric-cell">4</td>
<td></td><td></td><td>✓</td><td>✓</td><td></td><td></td><td></td><td>✓</td><td></td>
</tr>
<tr>
<td class="form-type-healthcare">Health Insurance Claim Form</td>
<td class="numeric-cell">10</td><td class="numeric-cell">50</td><td class="numeric-cell">400</td><td class="numeric-cell">5</td>
<td>✓</td><td></td><td>✓</td><td>✓</td><td>✓</td><td></td><td></td><td></td><td>✓</td>
</tr>
<!-- Legal & Compliance -->
<tr>
<td class="category-cell category-legal" rowspan="3">Legal & Compliance</td>
<td class="form-type-legal">NDA Submission Form</td>
<td class="numeric-cell">9</td><td class="numeric-cell">50</td><td class="numeric-cell">450</td><td class="numeric-cell">6</td>
<td>✓</td><td>✓</td><td></td><td>✓</td><td></td><td></td><td>✓</td><td>✓</td><td></td>
</tr>
<tr>
<td class="form-type-legal">Background Check Auth. Form</td>
<td class="numeric-cell">11</td><td class="numeric-cell">50</td><td class="numeric-cell">550</td><td class="numeric-cell">4</td>
<td>✓</td><td></td><td></td><td>✓</td><td></td><td></td><td>✓</td><td></td><td></td>
</tr>
<tr>
<td class="form-type-legal">Contractor Onboarding Form</td>
<td class="numeric-cell">14</td><td class="numeric-cell">50</td><td class="numeric-cell">700</td><td class="numeric-cell">6</td>
<td>✓</td><td></td><td>✓</td><td>✓</td><td>✓</td><td></td><td>✓</td><td></td><td>✓</td>
</tr>
<!-- Construction & Manufacturing -->
<tr>
<td class="category-cell category-construction" rowspan="2">Construction & Manufacturing</td>
<td class="form-type-construction">Project Bid Submission Form</td>
<td class="numeric-cell">13</td><td class="numeric-cell">50</td><td class="numeric-cell">650</td><td class="numeric-cell">5</td>
<td>✓</td><td></td><td></td><td>✓</td><td>✓</td><td></td><td></td><td>✓</td><td>✓</td>
</tr>
<tr>
<td class="form-type-construction">Manufacturing Order Form</td>
<td class="numeric-cell">13</td><td class="numeric-cell">50</td><td class="numeric-cell">650</td><td class="numeric-cell">5</td>
<td>✓</td><td></td><td>✓</td><td>✓</td><td>✓</td><td></td><td></td><td></td><td>✓</td>
</tr>
<!-- Overall -->
<tr>
<td class="category-cell" colspan="2">Overall</td>
<td class="numeric-cell">279</td><td class="numeric-cell">1,250</td><td class="numeric-cell">13,800</td><td class="numeric-cell">9</td>
<td>✓</td><td>✓</td><td>✓</td><td>✓</td><td>✓</td><td>✓</td><td>✓</td><td>✓</td><td>✓</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
<div class="content has-text-centered">
<p class="table-caption" style="text-align: center; font-style: italic; margin-top: 1rem;">
Table 1: Form field statistics across domains.
"Pair Count" refers to the total number of field-value pairs.
Abbreviations: Bin. Chc. = Binary Choice, Text Desp. = Text Description, Multi Chc. = Multiple Choice, Ckx. Input = Checkbox Input, Num. Input = Numeric Input.
<span style="color: #2ecc71;">✓</span> indicates the presence of a field type.
</p>
</div>
<!-- LLM-Driven Form Filling -->
<div class="columns is-centered has-text-centered" style="margin-top: 4rem;">
<div class="column is-full">
<h2 class="title is-3">Lightweight MLLM-driven Framework</h2>
<div class="content has-text-justified" style="max-width: 95%; margin: 0 auto 1rem auto;">
<p>
To enable automated execution and evaluation of form-filling tasks, we develop a lightweight MLLM-driven framework, shown in Figure 3. The system has three main components: a web-based form frontend, a backend scorer, and an agent execution module. Given a form and an input document (e.g., a resume), the LLM generates a sequence of GUI actions—such as Click(x, y) and Type(text)—to fill the form. These actions are executed automatically using tools like PyAutoGUI.After submission, the backend scorer compares the filled entries with ground-truth annotations and produces a detailed evaluation report, including field accuracy, action match rate, and overall task score. This framework supports scalable and fine-grained analysis of model performance in realistic form-filling tasks.
</p>
</div>
<div class="content has-text-justified">
<figure class="image" style="margin: 2rem auto;">
<img src="images/systemoverview_page_1.png" alt="LLM-Driven Form Filling Overview" style="max-width: 95%; margin: 0 auto;">
<figcaption class="has-text-centered" style="margin-top: 1rem; color: #4a4a4a; font-size: 0.95em;">
Figure 2: Overview of the form-filling task system. The platform takes a form and a resume as input, prompting an MLLM to generate GUI actions (Click, Type) for form completion. Execution and scoring modules evaluate task performance.
</figure>
</div>
</div>
</div>
<!-- Model Performance Evaluation -->
<section class="section" style="padding-top: 2rem;">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-full">
<h2 class="title is-3 has-text-centered" style="margin-bottom: 2rem;">Model Performance</h2>
<div class="content has-text-justified">
<p style="margin-bottom: 1rem;">
We evaluate several state-of-the-art MLLMs (e.g., GPT-4o, Gemini 2.5 Pro, Claude Sonnet 3.7) via their public APIs without task-specific fine-tuning, using an automated browser interface implemented with PyAutoGUI on a Windows platform. This setup enables fair assessment of the models' inherent capabilities in visual grounding, spatial reasoning, and field-value alignment. The evaluation includes both Atomic tasks (single field types) and Episodic tasks (end-to-end form filling), measuring fine-grained interaction and multi-step reasoning performance. We report two metrics: Click (UI interaction accuracy) and Value (field content accuracy), with BLEU used for generative Description fields.
</p>
</div>
</div>
</div>
<!-- 原有的模型评估表格内容 -->
<div class="columns is-centered has-text-centered">
<div class="column is-full">
<style>
.model-table {
width: 100%;
border-collapse: separate;
border-spacing: 0;
margin: 10px 0;
font-size: 0.9em;
box-shadow: 0 4px 20px rgba(0,0,0,0.05);
border-radius: 12px;
overflow: hidden;
background: white;
}
.model-table th, .model-table td {
padding: 16px 12px;
text-align: center;
border-bottom: 1px solid #eef2f7;
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", Arial, sans-serif;
}
.model-table th {
background-color: #f8fafc;
color: #475569;
font-weight: 600;
font-size: 0.85em;
letter-spacing: 0.5px;
text-transform: uppercase;
border-bottom: 2px solid #e2e8f0;
}
.model-table tr:nth-child(even) {
background-color: #fafbff;
}
.model-table tr:hover {
background-color: #f8fafc;
transition: background-color 0.2s ease;
}
.model-table .model-name {
text-align: left;
font-weight: 600;
color: #334155;
padding-left: 20px;
padding-right: 20px;
width: 220px;
white-space: nowrap;
}
.model-table .group-header {
background-color: #f1f5f9;
}
.model-table .group-header th {
padding: 14px 12px;
font-weight: 700;
color: #334155;
font-size: 0.9em;
}
.metric-header {
font-size: 0.8em;
color: #64748b;
font-weight: 500;
}
.model-table td:not(.model-name) {
font-family: 'Monaco', monospace;
color: #475569;
width: 60px;
}
.table-caption {
color: #475569;
font-size: 0.95em;
line-height: 1.6;
margin: 1.5rem auto;
max-width: 95%;
padding: 0 1rem;
text-align: left;
}
.table-container {
margin: 2rem 0;
padding: 0 1rem;
}
</style>
<div class="table-container">
<table class="model-table">
<thead>
<tr class="group-header">
<th rowspan="2" style="border-right: 1px solid #e2e8f0;">Model</th>
<th colspan="2">String</th>
<th colspan="2">Drop-down List</th>
<th colspan="2">Checkbox</th>
<th colspan="2">Radio Button</th>
<th colspan="2">Description</th>
<th colspan="2">Date</th>
<th colspan="2">Check</th>
<th colspan="2">Episodic</th>
</tr>
<tr>
<th class="metric-header">Click</th>
<th class="metric-header">Value</th>
<th class="metric-header">Click</th>
<th class="metric-header">Value</th>
<th class="metric-header">Click</th>
<th class="metric-header">Value</th>
<th class="metric-header">Click</th>
<th class="metric-header">Value</th>
<th class="metric-header">Click</th>
<th class="metric-header">Value</th>
<th class="metric-header">Click</th>
<th class="metric-header">Value</th>
<th class="metric-header">Click</th>
<th class="metric-header">Value</th>
<th class="metric-header">Click</th>
<th class="metric-header">Value</th>
</tr>
</thead>
<tbody>
<tr>
<td class="model-name">GPT 4o</td>
<td>2.2</td><td>17.5</td>
<td>0.0</td><td>30.7</td>
<td>0.0</td><td>31.3</td>
<td>0.0</td><td>10.0</td>
<td>8.8</td><td>0.84</td>
<td>0.0</td><td>2.8</td>
<td>0.0</td><td>9.8</td>
<td>0.9</td><td>11.3</td>
</tr>
<tr>
<td class="model-name">Gemini 2.5 Pro</td>
<td>0.9</td><td>98.7</td>
<td>0.0</td><td>99.0</td>
<td>0.0</td><td>76.1</td>
<td>0.0</td><td>52.6</td>
<td>8.1</td><td>0.72</td>
<td>0.0</td><td>99.7</td>
<td>0.0</td><td>79.8</td>
<td>0.4</td><td>70.7</td>
</tr>
<tr>
<td class="model-name">Claude 3.7 Sonnet</td>
<td>0.0</td><td>95.2</td>
<td>0.0</td><td>66.2</td>
<td>0.0</td><td>72.0</td>
<td>0.0</td><td>97.9</td>
<td>0.0</td><td>0.70</td>
<td>0.0</td><td>99.1</td>
<td>0.0</td><td>55.7</td>
<td>0.0</td><td>58.0</td>
</tr>
<tr>
<td class="model-name">Qwen-VL-Max</td>
<td>4.6</td><td>97.1</td>
<td>1.7</td><td>98.9</td>
<td>0.0</td><td>91.8</td>
<td>0.0</td><td>99.0</td>
<td>11.1</td><td>0.74</td>
<td>0.0</td><td>98.6</td>
<td>0.0</td><td>71.4</td>
<td>1.1</td><td>72.7</td>
</tr>
<tr>
<td class="model-name">Grok 3</td>
<td>0.0</td><td>96.2</td>
<td>0.0</td><td>91.2</td>
<td>0.0</td><td>92.4</td>
<td>0.0</td><td>98.1</td>
<td>5.9</td><td>0.71</td>
<td>0.0</td><td>97.9</td>
<td>0.0</td><td>75.5</td>
<td>0.0</td><td>70.7</td>
</tr>
<tr>
<td class="model-name">Doubao-vision-pro-32k</td>
<td>0.0</td><td>94.2</td>
<td>0.0</td><td>89.7</td>
<td>0.0</td><td>38.6</td>
<td>0.0</td><td>96.9</td>
<td>0.0</td><td>0.51</td>
<td>0.0</td><td>92.1</td>
<td>0.0</td><td>69.9</td>
<td>0.0</td><td>64.7</td>
</tr>
</tbody>
</table>
</div>
<p class="table-caption" style="text-align: center; font-style: italic; margin-top: 1rem;">
Table 2: Atomic-level and episodic-level evaluation of MLLMs across different field types using Clk. and Val. metrics. GPT-4o often refuses execution despite strong capabilities, resulting in lower atomic scores. The relatively high Clk. accuracy on Description fields stems from their large input area, which tolerates less precise clicks. Episodic results measure end-to-end form completion accuracy.
</p>
<div class="content has-text-justified">
<p style="margin-bottom: 1rem;">
Existing MLLMs still struggle to complete form-filling tasks reliably. Improving spatial reasoning and field alignment remains critical for enabling practical, GUI-driven agents in office automation scenarios.
</p>
</div>
</div>
</div>
<!-- Example Forms -->
<div class="columns is-centered">
<div class="column is-full">
<div class="content has-text-justified">
<p style="margin-bottom: 2rem;">
To better understand model performance beyond value prediction, we also analyze click accuracy on forms with varying numbers of fields. Click accuracy declines as the number of fields increases—a trend largely driven by higher visual complexity and denser layouts. Unlike value prediction, which applies across all field types, we report click accuracy only for String and Description fields, since model performance on other types like checkboxes and dropdowns remains consistently close to zero. Even with this filtering, click accuracy across all models remains low, highlighting the difficulty of precise pixel-level interaction and spatial grounding, even for simple text inputs.
</p>
<p style="margin-bottom: 2rem;">
We further explore how field count and field type interact, as they often co-occur in real-world scenarios and jointly contribute to task difficulty. To capture this, we conduct a 3D analysis that examines model performance across both variables. Models perform best when both field count and type complexity are low, and worst when both are high—suggesting a strong compounding effect. This pattern supports the design of our benchmark, which intentionally includes forms with both numerous and diverse fields to stress-test spatial reasoning, alignment, and grounding capabilities in current MLLMs.
</p>
<style>
.form-image-container .column {
padding: 0.5rem; /* 减小列间距 */
}
.form-image-container .columns {
margin-top: 1.5rem !important; /* 减小行间距 */
}
.form-image-container .image img {
max-width: 90%; /* 图片宽度稍微缩小 */
margin: 0 auto; /* 居中显示 */
display: block;
}
.form-image-container figcaption {
margin-top: 0.5rem !important; /* 减小说明文字的上边距 */
font-size: 0.9em; /* 稍微减小说明文字大小 */
}
</style>
<div class="form-image-container">
<!-- First Row -->
<div class="columns is-centered">
<div class="column">
<figure class="image">
<img src="images/3_page_1.png" alt="Form Example 3">
<figcaption class="has-text-centered">
Figure 3: Value accuracy across varying field counts (smoothed with a window size of 3)
</figcaption>
</figure>
</div>
<div class="column">
<figure class="image">
<img src="images/4_page_1.png" alt="Form Example 4">
<figcaption class="has-text-centered">
Figure 4: Value accuracy across varying field types
</figcaption>
</figure>
</div>
</div>
<!-- Third Row -->
<div class="columns is-centered">
<div class="column">
<figure class="image">
<img src="images/5_page_1.png" alt="Form Example 5">
<figcaption class="has-text-centered">
Figure 5: Click accuracy across varying field counts (smoothed with a window size of 3)
</figcaption>
</figure>
</div>
<div class="column">
<figure class="image">
<img src="images/6_page_1.png" alt="Form Example 6">
<figcaption class="has-text-centered">
Figure 6: Click accuracy across varying field types
</figcaption>
</figure>
</div>
</div>
<div class="columns is-centered">
<div class="column">
<figure class="image">
<img src="images/7_page_1.png" alt="Form Example 1">
<figcaption class="has-text-centered">
Figure 7: : Click accuracy under joint variation of field count and field types for Claude 3.7 Sonnet
</figcaption>
</figure>
</div>
<div class="column">
<figure class="image">
<img src="images/8_page_1.png" alt="Form Example 2">
<figcaption class="has-text-centered">
Figure 8: Click accuracy under joint variation of field count and field types for Qwen-VL-Max
</figcaption>
</figure>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<!--/ Concurrent Work. -->
</div>
</section>
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title">BibTeX</h2>
<pre><code>@misc{li2025formfactory,
title={FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents},
author={Bobo Li and Yuheng Wang and Hao Fei and Juncheng Li and Wei Ji and Mong-Li Lee and Wynne Hsu},
year={2025},
eprint={2506.01520},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.01520},
}</code></pre>
</div>
</section>
<footer class="footer">
<div class="columns is-centered">
<div class="column is-8">
<div class="content has-text-centered">
<p>
The webpage is built based on <a
href="https://github.com/nerfies/nerfies.github.io">Nerfies</a>.
</p>
</div>
</div>
</div>
</div>
</footer>
</main>
</body>
</html>