Skip to content

Conversation

@tristanls
Copy link
Contributor

@tristanls tristanls commented Aug 7, 2025

For easiest viewing on GitHub, use the "View file" option:
Screenshot 2025-04-11 at 11 38 51

@tristanls tristanls changed the title rfc: start of draft rfc: Cortical Messaging Protocol (CMP) Aug 7, 2025
@tristanls tristanls self-assigned this Aug 7, 2025
Copy link
Contributor

@vkakerbeck vkakerbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start on making this a bit more concrete. I just left one comment for now and a high-level comment would be that while there is a very strict order to this right now, I don't think that this is a conceptual requirement. In the brain for example, we can have various delays and neurons continue firing for certain periods and don't always wait for all inputs to arrive first. Voting would just be associative connections between L2/3 neurons that can reinforce each other as soon as there is an active representation there, not just in discrete steps after a column has received new input and votes.


#### 1.1.2 Learning Modules Receive Before Vote

Learning Modules MUST receive all Cortical Messages addressed to them before generating any Vote Cortical Messages for other Learning Modules.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently only applies to messages that originate from SM or LM classification outputs (in CC this would be the L4/6 feed forward input). The LMs generate a vote output before they receive votes (which are also a CMP message). Those would be the L2/3 lateral connections in the CC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍. If we retain this constraint, I'll clarify with an update to "... MUST receive all Observation Cortical Messages...".

@jeremyshoemaker
Copy link
Contributor

while there is a very strict order to this right now, I don't think that this is a conceptual requirement. In the brain for example, we can have various delays and neurons continue firing for certain periods and don't always wait for all inputs to arrive first. Voting would just be associative connections between L2/3 neurons that can reinforce each other as soon as there is an active representation there, not just in discrete steps after a column has received new input and votes.

This sounds similar to how I've been thinking about it. I'm not sure how much lockstep ordering there needs to be for it to work successfully. As I was saying to @tristanls yesterday in Slack, I'd be curious to build a prototype that doesn't have such a strict ordering (with the modules being able to communicate and respond to messages at any time) and see if we get similar results. Because in my mind, especially when we start to think about scaling to thousands of learning modules, it makes less sense to continue to try to run them all in lockstep.

@tristanls
Copy link
Contributor Author

tristanls commented Aug 8, 2025

@vkakerbeck @jeremyshoemaker OK, so the enumerated constraints seem too much.

Based on y'all's comments, I would then suggest these three constraints:


1.1.1 Message Propagation Delay

The Cortical Message propagation delay between the same sender and the same receiver MUST be constant for all steps.

1.1.2 Module Processing Duration

Each Module MUST take constant time to receive, process, and emit Cortical Messages.

1.1.3 Module Processing Completeness

Each Module MUST process all received Cortical Messages.


How do you feel about these constraints?

@jeremyshoemaker
Copy link
Contributor

Based on y'all's comments, I would then suggest these three constraints:
...
How do you feel about these constraints?

I have thoughts about all three of these, but I'll first ask what is driving each of them? I'm not sure what the why is and maybe that would help with the parts I'm struggling with.

One way I think about systems is to consider what would happen if each of the parts was a separate process or even on separate machines. It makes the unreliability and latency issues jump to the foreground.

All the constant time constraints, even if they're abstract constant time and not wall clock constant time, what are they for?

As for the third one, how would we know if a module didn't process all its messages? What would be the consequence of that? Which part of the system is the observer of this?

@tristanls
Copy link
Contributor Author

tristanls commented Aug 8, 2025

I'll first ask what is driving each of them?

We are designing/describing a protocol based on how we think cortical columns work. To me, the delay, duration, and completeness constraints represent the constraints of cortical columns and messaging between them.

@tristanls
Copy link
Contributor Author

how would we know if a module didn't process all its messages?

@jeremyshoemaker, I do not understand this specific question in the context of defining a protocol. We are specifying a protocol. If the protocol states that something MUST happen, then it is up to the implementer to ensure this. Otherwise, they did not implement the protocol.

To answer the other aspects. I would think that the consequences of not processing all messages would be analogous to chemistry and physics not working in a cortical column. Given that things are connected, how could a cortical column not incorporate the signals it receives?

@tristanls tristanls changed the title rfc: Cortical Messaging Protocol (CMP) rfc: Cortical Messaging Protocol (CMP) v1 Aug 8, 2025
@jeremyshoemaker
Copy link
Contributor

@jeremyshoemaker, I do not understand this specific question in the context of defining a protocol. We are specifying a protocol. If the protocol states that something MUST happen, then it is up to the implementer to ensure this. Otherwise, they did not implement the protocol.

I think most protocols define what error states they have and how to behave in those states.

To answer the other aspects. I would think that the consequences of not processing all messages would be analogous to chemistry and physics not working in a cortical column. Given that things are connected, how could a cortical column not incorporate the signals it receives?

Simple brain damage could cause it without having to resort to breaking chemistry or physics. Breaking or degrading of the synapses would do it.

I guess where I'm coming from with this line of questioning is, what things can break, and how would something that is running the protocol detect and recover from those sorts of problems. Like assume that not all the parts are implemented by the same person or team. If we're building shared components that use CMP, then those components need to be able to handle when other components aren't working correctly.

@scottcanoe
Copy link
Contributor

scottcanoe commented Aug 11, 2025

@tristanls Is this document the place to define how flags like "motor_only" are meant to work?

@tristanls
Copy link
Contributor Author

@scottcanoe, I don't think so. As far as I understand, "motor_only" is a message routing implementation detail of how we chose to do messaging, but I don't think it is a requirement of the protocol itself.

#### 2.1.7 Motor Modules

During the Motor Modules phase the Cortical Messages from Goal State Selectors are delivered to the Motor Modules. The Motor Modules process the Cortical Messages and output actions. The phase ends when all actions are created.
During the Motor Modules phase the Cortical Messages from Goal State Selectors are delivered to the Motor Modules. The Motor Modules process the Cortical Messages, observations, and proprioceptive state and generate zero, one, or more actions. The phase ends when all actions are created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently the case (although it happens in the dataloader/motor system since we don't have motor modules yet) but with the changes to SMs to be able to send goal states I would like to get to the point where any sensory processing happens in the SM and none in the motor system (except potentially some fast reflex loops for security purposes).

Copy link
Contributor

@nielsleadholm nielsleadholm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for starting to formalize this! Just added some thoughts.


Communication within a Thousand Brains System is conveyed via variable-length Cortical Messages.

CMP supports unicast and multicast delivery of data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I was not familiar with this terminology, I think adding something like "(i.e., one-to-one communication)... (i.e., one-to-many communication)" would be helpful.


CMP supports unicast and multicast delivery of data.

The data flow in CMP is unidirectional, from the sender to one or more receivers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: For clarity, it could be worth adding "In practice, other CMP messages can be sent resulting in bi-directional communication between two modules."


The data flow in CMP is unidirectional, from the sender to one or more receivers.

CMP is a connectionless protocol that maintains message boundaries with no connection setup or feature negotiation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I was able to understand this reading online, but it would be helpful to unpack this with a sentence or two for a more naive reader (like myself, or someone in the community).


#### 1.1.2 Module Processing Duration

Each Module MUST take constant time to receive, process, and emit Cortical Messages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: While I agree with receive, a module might spend more time processing its inputs before it emits a cortical message. Until it e.g. reaches a state of high confidence, it might not emit any Cortical Messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔, yes, good point.

Here's what I'm after with the processing duration constraint...

We are building an analogy to the cortical column. I understand (please correct me if I'm wrong; my knowledge here is limited) that a cortical column will spend (to some approximation) a fixed amount of time receiving and processing synaptic input. As you pointed out, just because it received and processed input does not mean it will spike (or emit Cortical Messages).

🤔, would this be better?

Each Module MUST take constant time to receive and process Cortical Messages. During that time, each module MAY also emit Cortical Messages.

Copy link
Contributor

@nielsleadholm nielsleadholm Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so partly I guess it comes down to current vs. potential future (more async) protocol.

Currently, the LM processing in Monty is indeed fixed time (or at least constrained so that all LMs finish processing before the logic continues). Whether the LM outputs a message on a given "cycle" of the CMP is optional. I think that fits with your suggested update.

Re. biology, while conduction delays along axons are largely fixed, the time it takes for a neuron to integrate incoming signals and spike depends on a complex combination of when incoming spikes arrive, as well as other factors like where on the neuron the signals arrive, and what inhibition if any is present. In general these dynamics are highly non-linear (for example, two spikes arriving nearby in time could summate much more strongly if they arrive just slightly closer together in time). I think this would ultimately fit better with the async description I wrote about separately. In that case, an LM can output a signal basically at arbitrary times, but whether someone else is listening and will integrate it will depend on their own state in the cycle.

spikes

Let me know if you want to jump on a huddle to discuss at any point, might be easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the confirmation and the biological explanation.

As you mentioned, the overall sync vs async discussion came up immediately after I proposed the draft CMP spec 😄. So, I added "v1" to navigate around this problem specifically. I'm thinking that to generate a specification that describes how Monty works today, so that other people can make their implementations or have an easier time integrating with Monty, we want v1 to be the synchronous protocol description, with the reference implementation as implemented today via virtual time step()s.

I propose we defer the asynchronous implementation and specification to a follow-on version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to pin down what is bothering me about the "constant time" phrasing and come up with a good alternative.

For me, the thought that comes up when constant time is mentioned in a software context is security. Preventing timing attacks by having all branches of an algorithm take the same amount of time. For clarity, what I mean by this is, for example, having a password checking algorithm take the exact same amount of time whether the user exists in the system or not, to prevent attackers from using the run time difference to find out whether a user has an account or not. There are other examples in encryption that are a bit more involved, but the same idea, preventing an outside observer from learning information based on the run time of an operation.

I don't think what is meant here by "constant time" is that they literally take the exact same number of seconds to process each step, but rather that the execution is bounded or moves in lock step. Tying into the conversation about whether a LM must send a CMP message, I think it would have to for the system to know all LMs have finished processing work, even if that message is a no-op "Yeah, I've got nothing". That's not required if the controlling system just introspects the state of each LM, but I think in the long run, a message would make more sense.

I think "constant time" gives a misleading idea of what we're talking about here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyshoemaker, I see how "constant time" can be problematic.

A no-op message is too much. There are other synchronization mechanisms than message-based ones. For example, in Monty, we can synchronize by knowing when all LM processing invocations return; no introspection is required.

It seems like I'll need to anchor things in a "hard real-time system" framing instead. I'll have to think some more about how to phrase this. There's the nuance that "real-time" could be "virtual-time" for some use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a better angle to come at it from. I was going to argue that maybe we shouldn't try to aim for hard real-time, since I would imagine a lot of applications wouldn't need it, but since we probably will have applications that need it, hard real-time is the stricter requirement, and soft real-time falls out of that by relaxing constraints.


CMP proceeds in non-overlapping steps made up of atomic phases. In each phase, Cortical Messages can be created, in-transit, or delivered. Creation and delivery happen transactionally, all-at-once, in a single, indivisible operation.

Each CMP step begins when the Thousand Brains System receives observations and proprioceptive state from the environment. Each CMP step ends when the Thousand Brains System outputs actions.
Copy link
Contributor

@nielsleadholm nielsleadholm Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: This implies a global stepping of the CMP signals. As per the other discussion going on, I think in the long-term, we would expect this to be relaxed and be more asynchronous. While I think it makes sense to describe the CMP as it currently exists, maybe it's worth having some of this kind of discussion now at the end of the RFC.

It's not entirely clear to me how this asynchrony would work, but some thoughts:

While it's true a learning module cannot Receive (2) before 1. Sensor Modules sends, I think in a physical system like a robot, a Sensor Module might have moved and be sending a new signal before 7. is reached (e.g. other part of the body moves, or the external world changes). One way we might handle this without everything breaking is that different modules have different steps they need to carry out internally before they are receptive to more information. For example, when an LM enters (2) + (3), the CMs that were received at that point are processed until (5) is reached. Until (5), no more incoming signals can impact that LM. Only after 5 can the LM receive a new SM's CM or votes (2/3). However, it does not need to wait until (7) before it can do this.

This raises the question of whether CM signals should persist until they are consumed, or have some temporary persistence determined by the sender (in the brain, likely the latter).

The above could also fit with multiple SM-LM-MM loops operating in parallel asynchronously. For example, the SM-LM-MM loop associated with you walking around a park with someone might be working very independently from one controlling the conversation you are having with the person. Thus there isn't a global (1) or (7) that happens everywhere all at once. However, there would still be some control-flow like whether an LM or Motor Module is receptive to new inputs.

edit: I don't know if this means that we end up aiming for a v1.0 CMP (synchronous, strongly time-locked), while acknowledging there may be a breaking future version of the CMP (semi-asynchronous). Such breaking might be offset by being able to re-introduce synchrony even in the future system?

Copy link
Contributor Author

@tristanls tristanls Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing I would add is that asynchrony-of-what will also be important (which, I think you allude to as well). We can have a single sync-step loop (draft CMP v1). We can have multiple globally synchronized sync-step loops (two independent Monty's responsible for different stuff, but acting on a single robot). We could have multiple globally asynchronous sync-step loops, where the coordination between independent sync-step loops is async. We could have a single async-loop. We could have multiple sync async-loops. We could have multiple async async-loops. To me, "async" means almost anything, and the solution space explodes rapidly. I don't view it as sync vs async, I view it as sync vs all the other possible things. This is why I propose we defer the asynchronous implementation and specification to a follow-on version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely happy to defer this to later. I think it's interesting to start thinking about now, but as you say the solution space is enormous, so not something we need to figure out today.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a very simplified neuron analogy, signals are arriving async across dendrites, but their “input” relative to the triggering of an action potential happens based on the combined influence on the state of the soma (again simplistic view). The dendritic field acts as spatiotemporal signal pooler. With dt ~ 0 → continuous, as we increase dt we don’t want to lose the information content of the signals. Digitization of an analog signal can be indistinguishable from the original as long as the salient information is retained (up to the time scale of the receiving system).

If the signal step and LM step are synchronized then wouldn’t we have: input state vector + LM state → output state. But the input state vector would need to encode any important temporal characteristics of the input space. If LM step is longer than signal step then wouldn’t we need to have at least temporal pooling on each input channel?

Would each LM step need to be synched to a global time or could they just have independent step timing? If independent I would expect the system to be more dynamic, harder to explain at any given point in time, take slightly more cycles to cohere, but would this possibly be better at encoding more stable associations?

If downstream LMs had temporal input pooling (as a general input process) would that provide more stability?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for joining the conversation @sjthomas-chillyhill , I'm not sure I entirely followed your discussion, but one thing to highlight re. e.g. "input state vector would need to encode any important temporal characteristics of the input space" is that we currently don't have any kind of temporal signaling in Monty, nor indeed a concept of dendritic branches with synapses. In other words, we are not currently modelling neurons with the kind of fidelity that spiking neural networks would exhibit, or even HTM networks. As such, I'm not sure how much we need to worry about these questions at the moment, although they would be interesting to consider if we brought in these more neural elements.

@sjthomas-chillyhill
Copy link

The current protocol would suggest that the LM->LM would process vertically at each step such that with LM.a -> LM.b -> LM.c the vote of LM.a would be be tallied by LM.b at T+1, and the vote of LM.b would be tallied by LM.c at T+2.
If there is also LM.c -(back to)-> LM.a then the vote of LM.c would be tallied by LM.a at T+3.
Would this be the same for horizontally connected AND vertically connected LMs? Some of the images sort of suggest that horizontally connected LM vote and tally as a group or is it that LMs that are connected AND have sensor input vote and tally as a group?

Does this suggest that the depth/breadth of the connections would represent a 'context window'.

in 2.1.6 "Each Goal State Selector then generates exactly one in-transit Cortical Message intended for the Motor Modules" Would it ever be the case that the GSS would also be an input to an LM? Perhaps to setup an hypothesis for the next SM input?

@tristanls
Copy link
Contributor Author

@sjthomas-chillyhill the intent behind vote & tally is that first, every LM anywhere (optionally) generates a vote, and second, all votes get delivered to every LM that should receive a vote. I'm not quite sure what T represents in your description, but assuming that T is a step count, all votes are generated and tallied within a single T (2.1.3 -> 2.1.4).

There is some missing context in the draft in that it does not yet highlight what kind of Cortical Message is sent out.

There are three kinds: Goal, Vote, and Percept.

2.1.1.

  • Sensor Module -> Learning Module: Percept
  • Sensor Module -> Goal State Selector: Goal

2.1.3

  • Learning Module -> Learning Module: Vote

2.1.5

  • Learning Module -> Learning Module (next step): Percept, Goal
  • Learning Module -> Goal State Selector: Goal

2.1.6

  • Goal State Selector -> Motor Module: Goal

Votes are generated and received within the same step (2.1.3 -> 2.1.4).

Percepts and Goals between Learning Modules cross the step boundary (2.1.5 -> 2.1.2).

@sjthomas-chillyhill
Copy link

assuming that T is a step count, all votes are generated and tallied within a single T (2.1.3 -> 2.1.4)
Yes T is (time) step.

This does clarify it nicely. Thanks.

Now I'm seeing the structure as a bit different. The deployed configuration of LMs at one level is a single operational unit regardless of the number of LM in the configuration. Would it be fair to say that the vote/tally part represents a separate information channel (e.g., from a messaging implementation perspective, a "topic"). And more specifically that it is fully encapsulated process that, in the current architecture, could not be asynchronous with respect to precept or goals.

@tristanls
Copy link
Contributor Author

Would it be fair to say that the vote/tally part represents a separate information channel (e.g., from a messaging implementation perspective, a "topic").

Yes, it is fair to say that voting happens through a separate information channnel from goals and percepts. In biology, the cortical column layer handling voting is different from layer handling goals and percepts.

To recap a simple version of the model (i.e. simplified to exclude top-down feedback or motor outputs):
We assume the simple features that are often detected experimentally in areas like V1 correspond to the feature input (layer 4) in a cortical column. Each column then integrates movement in L6, and uses features-at-locations to build a more stable representation of a larger object in L3 (i.e. larger than the receptive field of neurons in L4). L3’s lateral connections then support "voting", enabling columns to inform each-other’s predictions.
-- https://thousandbrainsproject.readme.io/docs/faq-monty#do-cortical-columns-in-the-brain-really-model-whole-objects-like-a-coffee-mug-in-v1

fully encapsulated process that, in the current architecture, could not be asynchronous with respect to precept or goals.

Yes. In CMP v1, as proposed here so far, this is true. CMP v1 proposal here is a very synchronous architecture. The only async part accommodated by CMP v1 is cross-step goal & percept messages (2.1.5 -> 2.1.2), and even that imposes only one fixed delay equal to a single (time) step.

To be clear, more asynchronous architecture is not forbidden by the theory. But, for now, I am deferring it to future iterations beyond CMP v1, at least for the purpose of this RFC proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants