Skip to content

Conversation

@mikeando
Copy link
Owner

This is dredged up from some old commits -- need to work out what this is doing and why.

@mikeando
Copy link
Owner Author

It looks like the piece to look at is the symbolize_new3 function.

For symbolizing a sequence of N "a"s using two symbols 'a' and 'aa' we find

n As old new3
40 6.1104 us
20 2.8378 us
10 35.102 us 1.4325 us
5 3.3971 us 0.72448 us

The key point is that the new method is roughly linear - while the old is not.

And the new version is faster than the old even for small inputs.

It works by using a dynamic programming / path-finding style approach.
(Which will need a lot of comments to explain what it is doing)

@mikeando
Copy link
Owner Author

mikeando commented Jul 30, 2021

Running on a more realistic data set the new method takes 2.0481 ms vs the old 9.6705 ms.

Data is Moby_Names_M_lc.txt - keeping as the symbols :

  • all characters with a count higher than 40
  • 20 best trigrams
  • 20 best trigrams

Then we symbolise all 3897 words.

@mikeando mikeando requested a review from Copilot October 3, 2025 01:30
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new, faster word-to-symbols conversion implementation alongside comprehensive benchmarking infrastructure. The implementation appears to be experimental work ("WIP" and "Not hooked in") exploring performance optimizations for symbolization algorithms.

  • Adds a complete new symbolization module with multiple algorithmic variants
  • Removes unnecessary reference operators in existing symbol operations
  • Implements benchmarking infrastructure to compare old vs new approaches

Reviewed Changes

Copilot reviewed 6 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
markovian-core/src/symbolize_new.rs New module with experimental symbolization algorithms and extensive test coverage
markovian-core/src/symbol.rs Removes redundant reference operators in slice method calls
markovian-core/src/renderer.rs Removes redundant reference operators in slice method calls
markovian-core/src/lib.rs Adds new symbolize_new module to public interface
markovian-core/benches/my_benchmark.rs Comprehensive benchmarking suite comparing old and new symbolization approaches
markovian-core/Cargo.toml Adds criterion benchmarking dependency and configuration

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

if s.len() > 1 {
for j in 0..s.len() - 1 {
symbol_index
.entry(&s[0..j])
Copy link

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slice range &s[0..j] should be &s[0..=j] to include the element at index j, matching the pattern used elsewhere in the codebase for prefix generation.

Suggested change
.entry(&s[0..j])
.entry(&s[0..=j])

Copilot uses AI. Check for mistakes.
let mut ff: FwdState = f.state();
ff.set_input_length(4).unwrap();

// We've set the length to 3 so we should be able to add to nodes less than or equal to 3
Copy link

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is incorrect - the test actually sets input length to 4, not 3. The comment should read 'We've set the length to 4 so we should be able to add to nodes less than or equal to 4'.

Copilot uses AI. Check for mistakes.
let mut ff: FwdState = f.state();
ff.set_input_length(4).unwrap();

// We've set the length to 3 so we should be able to add to nodes less than or equal to 3
Copy link

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is incorrect - the test actually sets input length to 4, not 3. The comment should read 'We've set the length to 4 so we should be able to add to nodes less than or equal to 4'.

Suggested change
// We've set the length to 3 so we should be able to add to nodes less than or equal to 3
// We've set the length to 4 so we should be able to add to nodes less than or equal to 4

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants