Skip to content

Query Search Path#482

Merged
mrferris merged 1 commit intoethereum:masterfrom
mrferris:full-tracing
Apr 19, 2023
Merged

Query Search Path#482
mrferris merged 1 commit intoethereum:masterfrom
mrferris:full-tracing

Conversation

@mrferris
Copy link
Contributor

@mrferris mrferris commented Nov 24, 2022

What was wrong?

portal_historyTraceRecursiveFindContent endpoint's output contains a list of nodes involved in a content query, but currently lacks a comprehensive trace of the query's search path.

How was it fixed?

Each content query trace is kept track of by a QueryTracer struct, that ultimately serializes to the following JSON format:

    {
         "found_content_at": "G",
         "origin": "A",
         "responses":
               "A": {
                       timestamp_ms: 0,
                       responded_with: ["B", "C", "D"],
               },
               "B": {
                       timestamp_ms: 124,
                       responded_with: ["E", "F", "G"],
               },
              "C": {
                      timestamp_ms: 150,
                      responded_with: [],
               },
              "G": {
                       timestamp_ms: 200,
                       responded_with: [],
               },
     }

where each letter is an ENR.
Each entry in responses is a map of a remote node's ENR to the list of ENRs that it responded with because it didn't have the content. origin is the local node, and the responses entry for the origin node shows the nodes that were closest to the content in its own routing table. The entry for the found_at node should have an empty responded_with field, as it is only used to mark the timestamp at which the content was received.

Each node's timestamp_ms field contains the number of milliseconds that elapsed between the query beginning and the response from the node being received.

Note that only the first node that responded with a given node A will have A in their responded_with list, so only the actual route taken is present in the data rather than other hypothetical routes. This means that many nodes will have empty responded_with arrays. These nodes did respond, but not with anything that hasn't already been seen in the query. This is the case for C in the example above.

A full visualization of a query using this data format is done here: https://github.com/pipermerriam/glados/pull/28

Here's a screenshot:
Screen Shot 2023-01-04 at 12 21 11 PM

and here's the route highlighted on a successful query:

Screen Shot 2022-12-05 at 5 12 48 PM

To-Do

  • Add timestamps
  • Use ENRs instead of Node IDs
  • Include data to distinguish between no response from a node and no progress toward content from a node
  • Unit tests
  • Add entry to the release notes
  • Clean up commit history

Copy link
Contributor

@jacobkaufmann jacobkaufmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither approving or requesting changes because I want to get your thoughts on the comments.

I'm also not sure how I feel about changing the query callback response type for FindContent. An alternative would be to create a struct for that type and include a field for the trace, which could be wrapped in an Option. I don't have a problem with doing the trace for all queries, because it could be useful for metrics, but I'm not sure it should be included in the response for non-trace find content queries.

I also find the structure of the response confusing, particularly the omission of previously returned ENRs from the arrays. If each response is timestamped, then you could infer the path if you assume that you only query a node the first time you see it. Alternatively, you could include a field in the response dedicated to representing the path.

@mrferris mrferris self-assigned this Feb 20, 2023
@mrferris mrferris force-pushed the full-tracing branch 3 times, most recently from 8b6155d to 649eedb Compare February 21, 2023 23:02
Copy link
Contributor

@perama-v perama-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Looks good to me. Tested by making a query (portal_historyTraceRecursiveFindContent) and inspecting the results.

I notice that the resulting "responses" includes the nodes own ID and includes "responded_with": ["node_x", "node_y", ...] for itself. This seems ok/useful, just wanted to note it.

Copy link
Member

@ogenev ogenev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking this until we agree on what to do with the Value type in TraceContentInfo.

@mrferris
Copy link
Contributor Author

I'm also not sure how I feel about changing the query callback response type for FindContent. An alternative would be to create a struct for that type and include a field for the trace, which could be wrapped in an Option.

Agreed 👍

I also find the structure of the response confusing, particularly the omission of previously returned ENRs from the arrays. If each response is timestamped, then you could infer the path if you assume that you only query a node the first time you see it.

I originally had it that way but didn't see any useful, non-chaotic way of visualizing that info on the front-end so it was unnecessary data & front-end parsing complexity. I'm open to a second iteration down the line that adds all response data.

@mrferris
Copy link
Contributor Author

Note that node_meta_data was added to the format, but will no longer be necessary once full ENR parsing is working on the front-end.

Also note that Node IDs are now being used as the primary ID for each node rather than ENRs. This was due to a bug caused by multiple ENRs for the same node ID making it into the QueryTrace.

Copy link
Contributor

@jacobkaufmann jacobkaufmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are some unaddressed comments from other reviewers, and I have some comments of my own.

At a high level, the following aspects of the trace structure do not make much sense to me:

  • that the origin node has an entry in the responses field
  • that only the first node to respond with some node X will be the only entry with X in that list

I believe we can find a better representation for the trace that doesn't warp the information in these ways. I would be more in favor of a straightforward representation that requires the caller to do whatever necessary disentanglement of the data on their end.

@mrferris
Copy link
Contributor Author

mrferris commented Mar 6, 2023

@jacobkaufmann

that the origin node has an entry in the responses field

Why would origin not have an entry? Where do we put the nodes that origin used to begin the query?

that only the first node to respond with some node X will be the only entry with X in that list

My previous response from above:

I originally had it that way (all responses) but didn't see any useful, non-chaotic way of visualizing that info on the front-end so it was unnecessary data & front-end parsing complexity. I'm open to a second iteration down the line that adds all response data.

Why add unused complexity to both the back-end and front-end? We can iterate this format as our needs progress.

@jacobkaufmann
Copy link
Contributor

apologies for missing some of the original comments on rationale.

the origin would not have entry in responses because there is no notion of "response" for the originating node. that information (the bootstrap query peers) could exist in a separate field. it seems like we are trying to stuff information into this shape that does not really accommodate the info in a clean way, so we can explore a different structure.

my claim is that right now we are introducing complexity on the back-end where things could be much simpler, and it would be better for any complexity to reside on the front-end. the simplest thing to do on the back-end is to provide all the response data as we received it. as long as the front-end can tease out the structure from that data, then it is preferable to keep our logic simple at the expense of the consumer.

@mrferris mrferris requested a review from ogenev March 9, 2023 07:12
@mrferris
Copy link
Contributor Author

mrferris commented Mar 10, 2023

@ogenev

Blocking this until we agree on what to do with the Value type in TraceContentInfo.

I've created QueryTrace, NodeInfo, and QueryResponse types to use in TraceContentInfo in ethportal-api, but kept the QueryTrace implementation that creates the QueryTrace in trin-core. Perhaps the one in trin-core should be named QueryTracer (with an r) and the one in ethportal-api named QueryTrace to denote that one creates the other. Or perhaps there should just be one type and it should live in ethportal-api.

I kept the QueryTrace implementation in trin-core because it didn't feel right to move impl'd functionality (and the imports that come with it) to ethportal-api when it wouldn't be used outside of Trin. It also allowed for the separation between internal trace representation and RPC trace representation that @jacobkaufmann was looking to see. Let me know if you disagree with that reasoning, or any feedback on the usage of ethportal-api types in the tests.

@mrferris
Copy link
Contributor Author

mrferris commented Mar 10, 2023

@jacobkaufmann

there is no notion of "response" for the originating node

The format is generalized: each node (including the origin node) responds to the query by either returning the requested content or by looking in its own routing table for closest ENRs to continue the query. I see no need to make the origin node's behavior an edge case and add complexity to the format instead of just describing its behavior the exact same way that the behavior of every other node is described. Seems cleaner than any non-generalized format to me.

This is the same reason why I felt it was best to change the output of a trace where the content is found locally from {} to {origin: origin_id, found_content_at: origin_id}. The format describes what happened in a generalized way, rather than piecewise.

I suppose the semantics of the word responses could use improvement. Feel free to propose alternatives, maybe closest_nodes_unseen.

my claim is that right now we are introducing complexity on the back-end where things could be much simpler, and it would be better for any complexity to reside on the front-end.

Could you elaborate on where complexity is being introduced to trin?

Are you referencing nodes only being included in the trace the first time they're seen? The single if statement that handles it is already there for other purposes, and when we make small modification to trin to add every response it will still be the same format, just with more items in each responses entry. I'm not seeing how there's any warping or complexity there.

I agree with adding all responses in the near future. It will allow us to answer/visualize interesting questions like the degree of routing table overlap across the network. Not sure if you're saying we should block this and come up with a new format before deploying the existing functionality.

@carver
Copy link
Contributor

carver commented Mar 14, 2023

@jacobkaufmann

there is no notion of "response" for the originating node

The format is generalized: each node (including the origin node) responds to the query by either returning the requested content or by looking in its own routing table for closest ENRs to continue the query. I see no need to make the origin node's behavior an edge case and add complexity to the format instead of just describing its behavior the exact same way that the behavior of every other node is described. Seems cleaner than any non-generalized format to me.

This is the same reason why I felt it was best to change the output of a trace where the content is found locally from {} to {origin: origin_id, found_content_at: origin_id}. The format describes what happened in a generalized way, rather than piecewise.

I suppose the semantics of the word responses could use improvement. Feel free to propose alternatives, maybe closest_nodes_unseen.

Yeah, naming this as a DAGSearchPath instead of a query trace might help? This data structure is the nodes and directed edges that were followed in the path. That explains why we drop the duplicate responses, and why the origin node lists outbound edges just like all the intermediate nodes do.

The reason to favor something like Jacob is talking about would be if we want to use the query trace for something besides visualizing a search path. In which case, it makes sense to keep the data structure as clearly describing the series of events that occurred (nodes responded, etc), rather than overfitting to this particular usage.

I don't have a clear idea what else we would use this for right now. So maybe just "admitting" that we're building this data for a single purpose, by naming it as such, is the way to go. Until we have a specific idea of what else we would use a query trace for.

@mrferris mrferris changed the title Full Content Query Tracing Query Search Path Tracing Mar 16, 2023
@mrferris mrferris changed the title Query Search Path Tracing Query Search Path Mar 16, 2023
@mrferris
Copy link
Contributor Author

Yeah, naming this as a DAGSearchPath instead of a query trace might help?

Sure.

The reason to favor something like Jacob is talking about would be if we want to use the query trace for something besides visualizing a search path. In which case, it makes sense to keep the data structure as clearly describing the series of events that occurred (nodes responded, etc), rather than overfitting to this particular usage.

I would agree if we were locking ourselves into anything. Adding all of the responses is a ~1 line code change in trin, and is backward compatible on the front-end.

@jacobkaufmann
Copy link
Contributor

here is a concrete example of the sort of thing I had in mind:

struct QueryPeer {
  requested_at: Instant,
  responded_at: Instant,
  response: QueryResponse,
}

enum QueryResponse {
  Peers(HashMap<Enr, QueryPeer>),
  Content,
}

struct QueryTrace {
  // originator of query (i.e. local node)
  origin: Enr,
  // target content ID.
  target: ContentId,
  // UTC timestamp for query start.
  started_at: Instant,
  // UTC timestamp for query end (termination).
  ended_at: Instant,
  // first level of peers in the map are the bootstrap peers, or those initially queried.
  trace: HashMap<Enr, QueryPeer>,
}

the structure could be modified to include peers who do not respond or use nested arrays instead of maps, but I just want to get the general idea across.

having said that, I'm okay moving forward with the existing structure. like @carver said, we can move forward with this special-purpose design until something more general is required. there are still some outstanding comments that I would like to see addressed though.

@mrferris mrferris force-pushed the full-tracing branch 2 times, most recently from 80d6b81 to c53b856 Compare March 25, 2023 01:51
Copy link
Member

@ogenev ogenev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we don't need two QueryTrace types - one in ethportal-api and one in query_trace.rs.

We can leave only the QueryTrace type in query_trace.rs and use some serde primitives not to serialize the started_at and target_id fields. We can also use the NodeId type from ethportal-api which will make implementing Serialize/Deserialize for this QueryTrace type trivial. Then we should re-export this type via ethportal-api for external users to access it.

In summary, this would look something like this:

  1. Remove the QueryTrace type from ethportal-api.
  2. Move QueryTrace from trin-core to trin-types.
  3. Use ethportal-api::discv5::NodeId type instead of discv5::enr::NodeId in QueryTrace.
  4. Implement Serialize/Deserialize for QueryTrace and add #[serde(skip_serializing)] attributes for started_at and target_id fields.
  5. Import QueryTrace in ethportal-api from trin-types and re-export it.

@mrferris mrferris force-pushed the full-tracing branch 4 times, most recently from 812b062 to b4b9992 Compare April 12, 2023 21:43
@mrferris mrferris requested a review from ogenev April 13, 2023 04:26
@perama-v
Copy link
Contributor

Tested, looks good. Three minor notes:

  1. One note is that the content is returned twice in the response. Perhaps the second received_content_from_node field could be skipped (#[serde(skip_serializing)])?
{
  "jsonrpc": "2.0",
  "result": {
    "content": "0x", // here
    "trace": {
      "received_content_from_node": null, // here
      "origin":
  1. Another note is that the timestamp that the request was initialised is returned back to the caller. Perhaps this could also be skipped as the caller will always know when they started this request.
      "started_at": { // ?omit
        "secs_since_epoch": <unix secs>,
        "nanos_since_epoch": <nanos>
      },
  1. The response differs from the (now outdated) response defined in book/src/developers/protocols/json_rpc.md.

@mrferris
Copy link
Contributor Author

@perama-v

One note is that the content is returned twice in the response

The first is the content itself, whereas the second is optional ID of the node that returned the content.

Perhaps this could also be skipped as the caller will always know when they started this request.

Good point, this was previously skipped, but with the way that QueryTrace is now defined, a default value is required to skip that field. The SystemTime type we're using doesn't implement Default, so I'm going to make a note to possibly remove this in the future when we decide whether to keep using relative timestamps or use absolute timestamps.

Copy link
Contributor

@carver carver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a quick scan. Seems like we're well into the territory of this being an improvement from where we were, with no major setbacks. I think Ognyan's ❌ is outdated and can be disregarded now. I say we address any last things, and I saw you put issues in for follow-up work, so it's good to go. I'm excited to see the results in glados!

// Re-exports jsonrpsee crate
pub use jsonrpsee;

pub use trin_types::discv5::*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally not a fan of the flattening of this trin_types module structure, but that seems to already be the standard in this module, so it seems like the right choice, incrementally. Maybe it's a standard we can talk about in person at the upcoming summit.

Comment on lines +31 to +55
let uniq_content_key =
"\"0x0015b11b918355b1ef9c5db810302ebad0bf2544255b530cdce90674d5887bb286\"";
let history_content_key: HistoryContentKey = serde_json::from_str(uniq_content_key).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it has extra steps that I don't understand. Why go into a json-encoded string and back out?

I see some other places that do this same concept like this, like:

    let content_key: HistoryContentKey =
        serde_json::from_value(json!(HISTORY_CONTENT_KEY)).unwrap();

which is a little better. But still, it seems like going in and out of json doesn't really do anything for us.

Something equivalent to this seems like the direct path that doesn't involve unnecessary json:

let key_bytes = "0015b11b918355b1ef9c5db810302ebad0bf2544255b530cdce90674d5887bb286";
let history_content_key = HistoryContentKey::from_bytes(hex::decode(key_bytes));

callback: Option<oneshot::Sender<FindContentResult>>,
is_trace: bool,
) -> Option<QueryId> {
info!("Starting query for content key: {}", target);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to reduce this to debug or drop it altogether, if we notice it being too little noisy, but I'm pretty happy to let that evolve over time, and be just a little generous with logs, especially with new features.

Comment on lines +198 to +215
let mut trace = QueryTrace::new(
&self.network.overlay.local_enr(),
NodeId::new(&content_key.content_id()).into(),
);
trace.node_responded_with_content(&local_enr);
(Some(val), if is_trace { Some(trace) } else { None })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a bummer to do all the trace work every time, even when it wasn't requested. Maybe something like:

Suggested change
let mut trace = QueryTrace::new(
&self.network.overlay.local_enr(),
NodeId::new(&content_key.content_id()).into(),
);
trace.node_responded_with_content(&local_enr);
(Some(val), if is_trace { Some(trace) } else { None })
let trace_option = if is_trace {
let mut trace = QueryTrace::new(
&self.network.overlay.local_enr(),
NodeId::new(&content_key.content_id()).into(),
);
trace.node_responded_with_content(&local_enr);
Some(trace)
} else { None }
(Some(val), trace_option)

Adds response timestamps to tracing output

Adds comments

Adds timestamp for content found event

Adds ENRs and distinction between no response and no progress

Passes ENRs by reference

Adds unit test

Update peertest to parse trace output

Add release notes

Small cleanup of query_tracer.rs

Add node metadata to trace

De-duplicate ENRs and rename to QueryTrace

Refactors node_responded_with to take a vec of all ENRs

Adds test of node_metadata values

Update ms->millis

Update jsonrpc types & test

Do not create or manage a QueryTrace for queries which don't require one

Define and test ethportal-api QueryTrace type

Use NodeId within trin_core::QueryTrace instead of String
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants