-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathcrawler.aro
More file actions
executable file
·30 lines (22 loc) · 1.26 KB
/
crawler.aro
File metadata and controls
executable file
·30 lines (22 loc) · 1.26 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
(* ============================================================
ARO Web Crawler - Crawl Logic
Handles the CrawlPage event to fetch URLs, track crawled pages,
and trigger link extraction.
============================================================ *)
(Crawl Page: CrawlPage Handler) {
(* Typed event extraction - validates against CrawlPageEvent schema *)
Extract the <event-data: CrawlPageEvent> from the <event>.
Log "Crawling: ${<event-data: url>}" to the <console> when <env: DEBUG> == "1".
(* Fetch the page *)
Request the <response> from the <event-data: url>.
Extract the <html> from the <response: body>.
(* Extract markdown content from HTML using ParseHtml action *)
ParseHtml the <markdown-result: markdown> from the <html>.
Extract the <title> from the <markdown-result: title>.
Extract the <markdown-content> from the <markdown-result: markdown>.
(* Save the markdown content to file *)
Emit a <SavePage: event> with { url: <event-data: url>, title: <title>, content: <markdown-content>, base: <event-data: base> }.
(* Extract links from the HTML *)
Emit a <ExtractLinks: event> with { url: <event-data: url>, html: <html>, base: <event-data: base> }.
Return an <OK: status> for the <crawl>.
}