WIP on much faster word->symbols conversion. Not hooked in - but benches exist #57

mikeando · 2021-07-30T13:44:15Z

This is dredged up from some old commits -- need to work out what this is doing and why.

…hes exist

mikeando · 2021-07-30T14:44:16Z

It looks like the piece to look at is the symbolize_new3 function.

For symbolizing a sequence of N "a"s using two symbols 'a' and 'aa' we find

n As	old	new3
40		6.1104 us
20		2.8378 us
10	35.102 us	1.4325 us
5	3.3971 us	0.72448 us

The key point is that the new method is roughly linear - while the old is not.

And the new version is faster than the old even for small inputs.

It works by using a dynamic programming / path-finding style approach.
(Which will need a lot of comments to explain what it is doing)

mikeando · 2021-07-30T14:45:43Z

Running on a more realistic data set the new method takes 2.0481 ms vs the old 9.6705 ms.

Data is Moby_Names_M_lc.txt - keeping as the symbols :

all characters with a count higher than 40
20 best trigrams
20 best trigrams

Then we symbolise all 3897 words.

* These are mostly due to needlessly borrowing things that are already references

Copilot

Pull Request Overview

This PR introduces a new, faster word-to-symbols conversion implementation alongside comprehensive benchmarking infrastructure. The implementation appears to be experimental work ("WIP" and "Not hooked in") exploring performance optimizations for symbolization algorithms.

Adds a complete new symbolization module with multiple algorithmic variants
Removes unnecessary reference operators in existing symbol operations
Implements benchmarking infrastructure to compare old vs new approaches

Reviewed Changes

Copilot reviewed 6 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
markovian-core/src/symbolize_new.rs	New module with experimental symbolization algorithms and extensive test coverage
markovian-core/src/symbol.rs	Removes redundant reference operators in slice method calls
markovian-core/src/renderer.rs	Removes redundant reference operators in slice method calls
markovian-core/src/lib.rs	Adds new symbolize_new module to public interface
markovian-core/benches/my_benchmark.rs	Comprehensive benchmarking suite comparing old and new symbolization approaches
markovian-core/Cargo.toml	Adds criterion benchmarking dependency and configuration

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-03T01:30:43Z

markovian-core/src/symbolize_new.rs

+                if s.len() > 1 {
+                    for j in 0..s.len() - 1 {
+                        symbol_index
+                            .entry(&s[0..j])


The slice range &s[0..j] should be &s[0..=j] to include the element at index j, matching the pattern used elsewhere in the codebase for prefix generation.

Suggested change

.entry(&s[0..j])

.entry(&s[0..=j])

Copilot · 2025-10-03T01:30:44Z

markovian-core/src/symbolize_new.rs

+            let mut ff: FwdState = f.state();
+            ff.set_input_length(4).unwrap();
+
+            // We've set the length to 3 so we should be able to add to nodes less than or equal to 3


Comment is incorrect - the test actually sets input length to 4, not 3. The comment should read 'We've set the length to 4 so we should be able to add to nodes less than or equal to 4'.

Copilot · 2025-10-03T01:30:44Z

markovian-core/src/symbolize_new.rs

+            let mut ff: FwdState = f.state();
+            ff.set_input_length(4).unwrap();
+
+            // We've set the length to 3 so we should be able to add to nodes less than or equal to 3


Comment is incorrect - the test actually sets input length to 4, not 3. The comment should read 'We've set the length to 4 so we should be able to add to nodes less than or equal to 4'.

Suggested change

// We've set the length to 3 so we should be able to add to nodes less than or equal to 3

// We've set the length to 4 so we should be able to add to nodes less than or equal to 4

WIP on much faster word->symbols conversion. Not hooked in - but benc…

9f22c8d

…hes exist

mikeando added 3 commits July 31, 2021 09:50

Fix clippy issues for new rust

03314a7

* These are mostly due to needlessly borrowing things that are already references

new_symbolise: fix new rust clippy issues.

10aeb4c

Add symbolise_new3 to more of the benchmarks

a44c655

mikeando requested a review from Copilot October 3, 2025 01:30

Copilot AI reviewed Oct 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP on much faster word->symbols conversion. Not hooked in - but benches exist #57

WIP on much faster word->symbols conversion. Not hooked in - but benches exist #57

Uh oh!

mikeando commented Jul 30, 2021

Uh oh!

mikeando commented Jul 30, 2021

Uh oh!

mikeando commented Jul 30, 2021 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 3, 2025

Uh oh!

Copilot AI Oct 3, 2025

Uh oh!

Copilot AI Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// We've set the length to 3 so we should be able to add to nodes less than or equal to 3
	// We've set the length to 4 so we should be able to add to nodes less than or equal to 4

WIP on much faster word->symbols conversion. Not hooked in - but benches exist #57

Are you sure you want to change the base?

WIP on much faster word->symbols conversion. Not hooked in - but benches exist #57

Uh oh!

Conversation

mikeando commented Jul 30, 2021

Uh oh!

mikeando commented Jul 30, 2021

Uh oh!

mikeando commented Jul 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikeando commented Jul 30, 2021 •

edited

Loading