allow for complex schema to refer to previously named schema#53
allow for complex schema to refer to previously named schema#53stew wants to merge 12 commits intoflavray:masterfrom
Conversation
Instead of trying to resolve the references at parse time, we store references in the AST, and then use the SchemaParseContext not only for parsing but also for decoding. This allows us to support recursive type definitions. I believe this patch should also include some improvements in how we handle namespaces and aliases. Tests are still needed to prove aliases are working correctly. Tests are still needed for testing encoding.
|
I believe this allow for recursive schemata which admits thing such as linked lists and trees. |
0ba3ab5 to
f957411
Compare
flavray
left a comment
There was a problem hiding this comment.
First of all, thank you for taking a stab at this, this is massive work!
I was not able to go through all the changes yet, but I left a few comments already, on the overall design of the change that I feel are worth discussing.
The fact that avro allows recursive types personally feels like madness to me, but heh, it's in the spec so we have to consider adding support for it, so I am truly grateful that you are doing this. 🙂
| Value::Union(ref inner) if inner.as_ref() == &Value::Null => visitor.visit_none(), | ||
| Value::Union(ref inner) => visitor.visit_some(&mut Deserializer::new(inner)), | ||
| _ => Err(Error::custom("not a union")), | ||
| ref inner =>visitor.visit_some(&mut Deserializer::new(inner)), |
| /// Decode a `Value` from avro format given its `Schema`. | ||
| pub fn decode<R: Read>(schema: &Schema, reader: &mut R) -> Result<Value, Error> { | ||
| match *schema { | ||
| pub fn decode<R: Read>(schema: &Arc<Schema>, reader: &mut R, context: &mut SchemaParseContext) -> Result<Value, Error> { |
There was a problem hiding this comment.
In order to keep backwards compatibility, would it make sense to have a new function (e.g: decode_with_context) with this signature?
That way, we can keep a decode function with the previous signature, that would just call decode_with_context(schema, reader, <empty context>)
| V: Visitor<'de>, | ||
| { | ||
| match *self.input { | ||
| Value::Null => visitor.visit_none(), |
There was a problem hiding this comment.
Not sure I understand why this is needed now. Would you mind giving some details? 🙂
The deserializer code is responsible for decoding data that exactly matches the Rust data type (all sorts of transformations can then be done via Value::resolve)
(Same for the ref inner 3 lines below)
There was a problem hiding this comment.
I'm not sure either, I tossed it in because it fixed a problem I was having deserializing an Option. If I can reproduce this, it deserves a unit test and to go into its own PR
| /// Decode a `Value` from avro format given its `Schema`. | ||
| pub fn decode<R: Read>(schema: &Schema, reader: &mut R) -> Result<Value, Error> { | ||
| match *schema { | ||
| pub fn decode<R: Read>(schema: &Arc<Schema>, reader: &mut R, context: &mut SchemaParseContext) -> Result<Value, Error> { |
There was a problem hiding this comment.
It seems unclear to me why the &Schema had to be changed into Arc<Schema>, would you mind elaborating on this as well? 🙂
There was a problem hiding this comment.
What I have done was to take extend the previous PR trying to accomplish this which made the transition from &Schema to Rc when we started temporarily additionally storing the schemata in the SchemaParseContext, and then I changed Rc to Arc because of the new requirement for Send to be implemented here:
https://gh/stew/avro-rs/blob/feature%2Frefernce-types/src/schema.rs#L1234
This is an area where my lack of experience with Rust is going to be obvious, so I welcome any advice, but it does seem to me that we should be able to go back to using references if we track the lifetimes all they way through so that we are proving that the additional references we are keeping only last for the lifetime of the decode operation. That was a battle I wasn't ready to wage with the typechecker when I was trying to get all this proved out, but I can take another stab at it.
| } | ||
| }, | ||
| Schema::TypeReference(ref name) => context.lookup_type(name, &context) | ||
| .map_or_else(|| Err(DecodeError::new("enum symbol not found").into()), |
There was a problem hiding this comment.
nit: maybe change enum into something else in the error message?
| /// encoding for complex type values. | ||
| pub fn encode(value: &Value, schema: &Schema, buffer: &mut Vec<u8>) { | ||
| encode_ref(&value, schema, buffer) | ||
| encode_ref_inner(&value, &Arc::new(schema.clone()), buffer, &mut SchemaParseContext::new()) |
There was a problem hiding this comment.
Wouldn't cloning a schema every time a value is encoding end up being expensive?
We would have to run the benchmarks to see if there is a noticeable difference, but it seems like extra work is being done in this function
| /// be valid with regards to the schema. Schema are needed only to guide the | ||
| /// encoding for complex type values. | ||
| pub fn encode_ref(value: &Value, schema: &Schema, buffer: &mut Vec<u8>) { | ||
| pub(crate) fn encode_ref_inner(value: &Value, schema: &Arc<Schema>, buffer: &mut Vec<u8>, context: &mut SchemaParseContext) { |
There was a problem hiding this comment.
As for decode, would it make sense to keep encode_ref available somewhere, to avoid backwards-incompatible changes?
| return Ok(()) | ||
| }, | ||
| Err(e) => if let ErrorKind::UnexpectedEof = e.downcast::<::std::io::Error>()?.kind() { | ||
| Err(e) =>{ |
| /// A reference to a type defined in this schema | ||
| TypeReference(NameRef), | ||
| /// A `array` Avro schema. | ||
| /// `Array` holds a counted reference (`Rc`) to the `Schema` of its items. |
| /// `Map` holds a pointer to the `Schema` of its values, which must all be the same schema. | ||
| /// `Map` keys are assumed to be `string`. | ||
| Map(Box<Schema>), | ||
| Map(Arc<Schema>), |
There was a problem hiding this comment.
I am also curious as to why Array and Map now need an Arc, instead of a simple Box. Would you mind elaborating a bit? 🙂
|
Wow, thanks for all your feedback 😄 I'm going to try to address all the items you brought up, might take me a day or two. I'm pretty new to rust, so I'm likely making some simple mistakes, so thanks for bearing with me. I also am expecting that I've negatively impacted performance and I haven't yet taken a look at how bad, I wanted to get some feedback on the approach before getting too hung up on any details, and I've probably taken some shortcuts with some negative performance impacts (like using Arc), which I'm happy to try my best to figure out. Of course, the very first set of types I'm needing to encode includes a Tree, which is naturally a recursive data structure, which is where my motivation is coming from. In trying to further my own project while waiting for feedback from you, I've tossed some bugfixes into this branch which don't necessarily fit with this PR, I can make some effort to separate those and rebase those against master to simplify this PR if that's wanted. |
|
I'd be very interested in this as well. As while the schema i'm using does not define recursive types, it does define some custom types (like a type for UUID values) which are defined at first use and then afterwards only referenced by name in the schemata (which was generated using avro tools from a idl spec). |
|
Been a while. How's this pull going? Need some help? How far to completion? |
|
closing in favor of #99 ; I'm thrilled that someone else is making an effort here, (thanks @GregBowyer :) ) I apologize that the conditions that got me motivated to get this working didn't exist for long :( |
No description provided.