Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -125,3 +125,4 @@ testrun.log
/Examples/AuditLogDemo/AuditLogDemo
/Examples/WebSocketDemo/WebSocketDemo
/Examples/StateMachine/StateMachine
core
146 changes: 73 additions & 73 deletions Book/AROByExample/AppendixA-CompleteCode.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,27 +16,27 @@ The application entry point. Reads the starting URL and kicks off the crawl.
============================================================ *)

(Application-Start: Web Crawler) {
<Log> "Starting Web Crawler..." to the <console>.
Log "Starting Web Crawler..." to the <console>.

(* Read starting URL from environment *)
<Extract> the <start-url> from the <env: CRAWL_URL>.
Extract the <start-url> from the <env: CRAWL_URL>.

<Log> "Starting URL: ${<start-url>}" to the <console>.
Log "Starting URL: ${<start-url>}" to the <console>.

(* Create output directory *)
<Create> the <output-path> with "./output".
<Make> the <output-dir> to the <directory: output-path>.
<Log> "Output directory created" to the <console>.
Create the <output-path> with "./output".
Make the <output-dir> to the <directory: output-path>.
Log "Output directory created" to the <console>.

(* Queue initial URL - Emit blocks until the entire crawl chain completes *)
<Emit> a <QueueUrl: event> with { url: <start-url>, base: <start-url> }.
Emit a <QueueUrl: event> with { url: <start-url>, base: <start-url> }.

<Return> an <OK: status> for the <startup>.
Return an <OK: status> for the <startup>.
}

(Application-End: Success) {
<Log> "🥁 Web Crawler completed!" to the <console>.
<Return> an <OK: status> for the <shutdown>.
Log "🥁 Web Crawler completed!" to the <console>.
Return an <OK: status> for the <shutdown>.
}
```

Expand All @@ -56,27 +56,27 @@ The core crawling logic. Fetches pages and triggers downstream processing.

(Crawl Page: CrawlPage Handler) {
(* Extract from event data *)
<Extract> the <event-data> from the <event: data>.
<Extract> the <url> from the <event-data: url>.
<Extract> the <base-domain> from the <event-data: base>.
Extract the <event-data> from the <event: data>.
Extract the <url> from the <event-data: url>.
Extract the <base-domain> from the <event-data: base>.

<Log> "Crawling: ${<url>}" to the <console>.
Log "Crawling: ${<url>}" to the <console>.

(* Fetch the page *)
<Request> the <html> from the <url>.
Request the <html> from the <url>.

(* Extract markdown content from HTML using ParseHtml action *)
<ParseHtml> the <markdown-result: markdown> from the <html>.
<Extract> the <title> from the <markdown-result: title>.
<Extract> the <markdown-content> from the <markdown-result: markdown>.
ParseHtml the <markdown-result: markdown> from the <html>.
Extract the <title> from the <markdown-result: title>.
Extract the <markdown-content> from the <markdown-result: markdown>.

(* Save the markdown content to file *)
<Emit> a <SavePage: event> with { url: <url>, title: <title>, content: <markdown-content>, base: <base-domain> }.
Emit a <SavePage: event> with { url: <url>, title: <title>, content: <markdown-content>, base: <base-domain> }.

(* Extract links from the HTML *)
<Emit> a <ExtractLinks: event> with { url: <url>, html: <html>, base: <base-domain> }.
Emit a <ExtractLinks: event> with { url: <url>, html: <html>, base: <base-domain> }.

<Return> an <OK: status> for the <crawl>.
Return an <OK: status> for the <crawl>.
}
```

Expand All @@ -95,102 +95,102 @@ Link extraction, normalization, filtering, and queuing.

(Extract Links: ExtractLinks Handler) {
(* Extract from event data structure *)
<Extract> the <event-data> from the <event: data>.
<Extract> the <html> from the <event-data: html>.
<Extract> the <source-url> from the <event-data: url>.
<Extract> the <base-domain> from the <event-data: base>.
Extract the <event-data> from the <event: data>.
Extract the <html> from the <event-data: html>.
Extract the <source-url> from the <event-data: url>.
Extract the <base-domain> from the <event-data: base>.

(* Use ParseHtml action to extract all href attributes from anchor tags *)
<ParseHtml> the <links: links> from the <html>.
ParseHtml the <links: links> from the <html>.

(* Process links in parallel - repository Actor ensures atomic dedup *)
parallel for each <raw-url> in <links> {
<Emit> a <NormalizeUrl: event> with {
Emit a <NormalizeUrl: event> with {
raw: <raw-url>,
source: <source-url>,
base: <base-domain>
}.
}

<Return> an <OK: status> for the <extraction>.
Return an <OK: status> for the <extraction>.
}

(Normalize URL: NormalizeUrl Handler) {
(* Extract from event data structure *)
<Extract> the <event-data> from the <event: data>.
<Extract> the <raw-url> from the <event-data: raw>.
<Extract> the <source-url> from the <event-data: source>.
<Extract> the <base-domain> from the <event-data: base>.
Extract the <event-data> from the <event: data>.
Extract the <raw-url> from the <event-data: raw>.
Extract the <source-url> from the <event-data: source>.
Extract the <base-domain> from the <event-data: base>.

(* Determine URL type and normalize *)
match <raw-url> {
case /^https?:\/\// {
(* Already absolute URL - strip fragment and trailing slash *)
<Split> the <frag-parts> from the <raw-url> by /#/.
<Extract> the <no-fragment: first> from the <frag-parts>.
<Split> the <slash-parts> from the <no-fragment> by /\/+$/.
<Extract> the <clean-url: first> from the <slash-parts>.
<Emit> a <FilterUrl: event> with { url: <clean-url>, base: <base-domain> }.
Split the <frag-parts> from the <raw-url> by /#/.
Extract the <no-fragment: first> from the <frag-parts>.
Split the <slash-parts> from the <no-fragment> by /\/+$/.
Extract the <clean-url: first> from the <slash-parts>.
Emit a <FilterUrl: event> with { url: <clean-url>, base: <base-domain> }.
}
case /^\/$/ {
(* Just "/" means root - use base domain as-is (no trailing slash) *)
<Emit> a <FilterUrl: event> with { url: <base-domain>, base: <base-domain> }.
Emit a <FilterUrl: event> with { url: <base-domain>, base: <base-domain> }.
}
case /^\// {
(* Root-relative URL: prepend base domain, strip fragment and trailing slash *)
<Create> the <joined-url> with "${<base-domain>}${<raw-url>}".
<Split> the <frag-parts> from the <joined-url> by /#/.
<Extract> the <no-fragment: first> from the <frag-parts>.
<Split> the <slash-parts> from the <no-fragment> by /\/+$/.
<Extract> the <clean-url: first> from the <slash-parts>.
<Emit> a <FilterUrl: event> with { url: <clean-url>, base: <base-domain> }.
Create the <joined-url> with "${<base-domain>}${<raw-url>}".
Split the <frag-parts> from the <joined-url> by /#/.
Extract the <no-fragment: first> from the <frag-parts>.
Split the <slash-parts> from the <no-fragment> by /\/+$/.
Extract the <clean-url: first> from the <slash-parts>.
Emit a <FilterUrl: event> with { url: <clean-url>, base: <base-domain> }.
}
case /^(#|mailto:|javascript:|tel:|data:)/ {
(* Skip fragments and special URLs *)
}
}

<Return> an <OK: status> for the <normalization>.
Return an <OK: status> for the <normalization>.
}

(Filter URL: FilterUrl Handler) {
(* Extract from event data structure *)
<Extract> the <event-data> from the <event: data>.
<Extract> the <url> from the <event-data: url>.
<Extract> the <base-domain> from the <event-data: base>.
Extract the <event-data> from the <event: data>.
Extract the <url> from the <event-data: url>.
Extract the <base-domain> from the <event-data: base>.

(* Filter URLs that belong to the same domain as base-domain *)
<Emit> a <QueueUrl: event> with { url: <url>, base: <base-domain> } when <url> contains <base-domain>.
Emit a <QueueUrl: event> with { url: <url>, base: <base-domain> } when <url> contains <base-domain>.

<Return> an <OK: status> for the <filter>.
Return an <OK: status> for the <filter>.
}

(Queue URL: QueueUrl Handler) {
(* Extract from event data structure *)
<Extract> the <event-data> from the <event: data>.
<Extract> the <url> from the <event-data: url>.
<Extract> the <base-domain> from the <event-data: base>.
Extract the <event-data> from the <event: data>.
Extract the <url> from the <event-data: url>.
Extract the <base-domain> from the <event-data: base>.

(* Generate deterministic id from URL hash for deduplication *)
<Compute> the <url-id: hash> from the <url>.
Compute the <url-id: hash> from the <url>.

(* Store with id - repository deduplicates by id, observer only fires for new entries *)
<Create> the <crawl-request> with { id: <url-id>, url: <url>, base: <base-domain> }.
<Store> the <crawl-request> into the <crawled-repository>.
Create the <crawl-request> with { id: <url-id>, url: <url>, base: <base-domain> }.
Store the <crawl-request> into the <crawled-repository>.

<Return> an <OK: status> for the <queue>.
Return an <OK: status> for the <queue>.
}

(Trigger Crawl: crawled-repository Observer) {
(* React to new entries in the repository *)
<Extract> the <crawl-request> from the <event: newValue>.
<Extract> the <url> from the <crawl-request: url>.
<Extract> the <base-domain> from the <crawl-request: base>.
Extract the <crawl-request> from the <event: newValue>.
Extract the <url> from the <crawl-request: url>.
Extract the <base-domain> from the <crawl-request: base>.

<Log> "Queued: ${<url>}" to the <console>.
<Emit> a <CrawlPage: event> with { url: <url>, base: <base-domain> }.
Log "Queued: ${<url>}" to the <console>.
Emit a <CrawlPage: event> with { url: <url>, base: <base-domain> }.

<Return> an <OK: status> for the <observer>.
Return an <OK: status> for the <observer>.
}
```

Expand All @@ -212,20 +212,20 @@ File storage handler.

(Save Page: SavePage Handler) {
(* Extract event data *)
<Extract> the <event-data> from the <event: data>.
<Extract> the <url> from the <event-data: url>.
<Extract> the <title> from the <event-data: title>.
<Extract> the <content> from the <event-data: content>.
Extract the <event-data> from the <event: data>.
Extract the <url> from the <event-data: url>.
Extract the <title> from the <event-data: title>.
Extract the <content> from the <event-data: content>.

(* Generate a hash of the URL for the filename.
Hashes are unique and filesystem-safe.
The actual URL is preserved in the file content. *)
<Compute> the <url-hash: hash> from the <url>.
Compute the <url-hash: hash> from the <url>.

(* Build the file path with string interpolation *)
<Create> the <file-path> with "./output/${<url-hash>}.md".
Create the <file-path> with "./output/${<url-hash>}.md".

<Log> "Saving: ${<url>} to ${<file-path>}" to the <console>.
Log "Saving: ${<url>} to ${<file-path>}" to the <console>.

(* Format the Markdown file with metadata.
\n creates newlines.
Expand All @@ -234,13 +234,13 @@ File storage handler.
- Source URL for reference
- Separator
- Actual content *)
<Create> the <file-content> with "# ${<title>}\n\n**Source:** ${<url>}\n\n---\n\n${<content>}".
Create the <file-content> with "# ${<title>}\n\n**Source:** ${<url>}\n\n---\n\n${<content>}".

(* Write the content to the file.
The 'file:' specifier indicates the target is a file path. *)
<Write> the <file-content> to the <file: file-path>.
Write the <file-content> to the <file: file-path>.

<Return> an <OK: status> for the <save>.
Return an <OK: status> for the <save>.
}
```

Expand Down
Loading
Loading