Skip to content

Schema resolution lacks named type checking and allows invalid coercions #383

@KitFieldhouse

Description

@KitFieldhouse

Hi all!

While working on issue #353, I have also bumped into some possible issues with how we currently perform writer to reader schema resolution.

There are two issues here:

  • Resolution coercions in this implementation that are not in the specification.
  • Not using type names in resolution.

Currently, in my reading of the code, writer to reader schema resolution is performed by first deserializing some buffer into a Value using the "writer" schema and then by calling resolve (or resolve_schemata) on this value with the reader schema. I think the fact that we are performing schema resolution on the value directly without reference to the original writer schema is central to both of these issues.

Resolution coercions not in the specification

Currently, we are allowed to resolve Value::Map against a Schema::Record and to resolve Value::Long against a Schema::Int. This last resolution rule causes buggy behavior, as seen by this test.

#[test]
fn test_long_to_int() -> TestResult {
    init();
    let writer_schema = r#"{
        "type": "record",
        "name": "ASchema",
        "fields": [
        {
            "name": "aNumber",
            "type": "long"
        }
        ]
    }"#;

    let reader_schema = r#"{
        "type": "record",
        "name": "ASchema",
        "fields": [
        {
            "name": "aNumber",
            "type": "int"
        }
        ]
    }"#;

    // value that was written with writer_schema
    let value = Value::Record(vec![(
            "aNumber".into(),
            Value::Long(300000000000)
    )
    ]);

    let buffer = TestBuffer(Rc::new(RefCell::new(Vec::new())), 0);

    let writer_schema = Schema::parse_str(writer_schema)?;

    let mut writer = Writer::new(&writer_schema, buffer.clone())?;

    let _ = writer.append(value);
    let _ = writer.flush();

    let reader_schema = Schema::parse_str(reader_schema)?;
    let mut reader = Reader::with_schema(&reader_schema,buffer)?;

    let read_value = reader.next();

    match read_value {
        Some(v) => {
            panic!("Reader and writer schema are incompatible, should not be able to deserialize. Deser value: {:?}", v)
        }
        None => {}
    }
    Ok(())
}

Currently, this test fails and the value of the "aNumber" field in the reader schema is a silently overflowed integer.

Not using type names in resolution

The specification's rules for schema resolution indicate that we first compare the names of named types, and if that succeeds, perform resolution recursively on that value. Since Value does not include names for named types (fixed, record, and enum), this check is not performed.

What should we do

The fix for the long->int conversion is just to remove this arm from the resolution logic we currently have. However, I think this issue is actually a little deeper.

Schema resolution as defined by the specification requires three things, a writer schema, a reader schema and a value written with the writer schema (the value is needed for resolving which union branch of the reader schema to take, if possible).

As of now, we are performing schema resolution on just the value itself. Since the value doesn't actually have schema level information like the name of named types, there are schema resolution checks (matching via unqualified name) that are impossible with the current structure of Value. Therefore, our resolution feels completely structural. However, my reading of the resolution rules from the specification indicate that the names of named types are important and are used, hence resolution is at least quasi-nominal.

Proposed solution

Reorient Value as being a flexible user-facing way of creating Avro values. Value should include the minimum amount of information to be verified against a provided schema but nothing more. That means:

  • Get rid of Value::Union. In my opinion, a union is a schema-level construct describing possible types; a concrete value inhabits a specific branch, not the union itself
  • Rename Value::Enum as Value::EnumVariant. Enum is the type, variant is the value.
  • Values of named types (Record, Fixed, Enum) should allow an optional name. If no name is provided, resolution is essentially the same structural resolution we have now. If a name is provided, we first perform resolution by name, and if that succeeds, perform structural resolution.

To fulfill the semantic role of a value of a given schema, we should create a new type: TypedValue that wraps both a Value and a schema and verifies that the provided Value is indeed a valid value that could have been written with the schema, performing coercion steps if needed (these coercion steps are free to be a superset of the resolution steps listed in the specification). In doing so, Value should also be transformed into a new type CompleteValue that includes information from the schema its wrapped with, like named type names and union branch numbers.

Schema resolution is then a transformation from TypedValue of one schema to a TypedValue of another by performing exactly the resolution steps listed in the specification.

To discuss

Since this has overlap with the work from #353, I was thinking of rolling these two up as one eventual PR. Let me know if the above proposal makes sense.

Go Avro!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions