Skip to content
Max Afonov edited this page Jan 5, 2011 · 2 revisions

Performance challenges

Most serialization frameworks treat object graphs as black boxes that are best explored from without. A common activity consists of walking the tree of objects and serializing anything that doesn't fit natively into the target format. The obvious drawback of this approach is that a ton of CPU time is wasted during each serialization run. The code usually has to:

  1. Obtain the run time type of each node in the graph. This is a crucial step that helps decide which strategy needs to be used in transforming values between native and serialized formats.

  2. If serialization is performed, data needs to be retrieved from the object, field by field, or injected back into the object, likewise field by field. This step can be the source of a particularly heavy performance hit, especially during deserialization. Well-designed classes usually encapsulate and protect their fields. The deserializer must either break encapsulation and set fields reflectively, or involve mutator methods. Either way, reflection is involved in fetching the list of fields and/or their accessors and mutators, and in retrieving or setting back the data.

  3. Recursively (de)serialize any other objects in the graph that may be accessible from the current node, depending on requirements. More run time type information needs to be collected, and the performance issues are compounded accordingly.

Fighting generic type erasure

The Java virtual machine is infamous for its type erasure. Even though Java and Scala (to a larger extent) both support generic types, the particular type arguments used in code are erased during compilation. Thus the following two expressions are of exactly the same type (List) at run time:

val ice_cream: List[Pleasure] = ...
val you_scream: List[Agony] = ...

Type erasure is a particular problem for serialization frameworks that operate on object graphs that involve generic types. Most real-world data models are bound to contain at least some generically typed objects, i.e. collections and other containers. Without these modern conveniences, IA work would not be the same. Still, they are a curse, and careful run time type introspection needs to be performed in order to figure out how to (de)serialize such values.

Shortcuts can be made, of course. For example, oftentimes collections contain only objects belonging to a single concrete class. If this information is somehow known at run time, the serializer's job becomes slightly easier. Alas, many collections are polymorphic; in other words, the type argument is a trait, interface, or abstract class. One obvious consequence is that this polymorphic type can't be used during deserialization to provide a concrete implementation of itself suitable for object instantiation.

Way out

There really isn't a fundamental logical flaw in the process described above. Salat doesn't reinvent the wheel in any way; the code is however optimized in ways that hopefully can help avoid common serialization pitfalls.

Ahead of time type inspection

Salat performs very little reflection on types at run time. It achieves this by instead parsing the [[pickled Scala signatures|http://lamp.epfl.ch/~dubochet/new_pickle.pdf]], which are present in all .class files produced by scalac version 2.8.0 and above. ScalaSigParser (part of the scalap tool included with the SDK) facilitates programmatic access to these signatures, which in turn contain all type information that was available to the compiler during its original run.

This is exactly what every serialization framework needs. High-fidelity type information, accessible using clean, pattern-matchable graphs of case classes. By collecting and memoizing this information before any (de)serialization is performed, Salat is able to avoid most runtime performance hits that plague other frameworks of this kind. Because the vast majority of type inspection is done ahead of time, the code is able to later make certain (admittedly safe) assumptions about object graphs that it encounters over time, and thus execution time is cut dramatically.

Memoized transformer chains

Since all type information is available ahead of time, Salat is able to assemble flexible chains of transformers: objects that know how to either serialize or deserialize values of only the types that are relevant to the particular kind of transformer in question. Chains of transformers are necessary because objects are often deeply nested. For example, a case class A may have a field of type Map[String, B], where the values are of type case class B, which in turn has a List[C], and so on. At the same time, other fields may be typed using other combinations of types. Most of this code can be reused and then chained to produce flexible chains or pipes that feed data to each other.

Again, all assembly of transformer chains happens ahead of time. Salat simply executes these chains on inputs that become available at run time, and thus saves precious time that would otherwise be wasted on type introspection.

Efficient data extraction

One important design limitation that had to be accepted (with due consideration) is lack of support for any classes except case classes. While [[case classes are many things at once|http://stackoverflow.com/search?q=%5Bscala%5D+what+is+case+class]], most relevant to Salat is the fact that case classes mix in [[the scala.Product trait|http://www.scala-lang.org/api/current/scala/Product.html]]. The most important facility provided by Product is productIterator. This method returns an iterator that allows the caller to iteratively access each of the elements, which together make up the product. For example, given the following case class and an instance thereof:

case class Person(name: String, age: Int, education: Option[String])
val joe = Person(name = "Joe S. Ixpack", age = 37, education = None)

The call to productIterator would return something like this:

scala> val elements = joe.productIterator.toList
elements: List[Any] = List(Joe S. Ixpack, 37, None)

It is clear that each of the fields can be extracted very easily. What's more important is that productIterator is implemented by the compiler in a very efficient manner. This method takes advantage of another scala.Product facility, namely productElement, which takes an integer positional value, uniquely identifying which of the fields needs to be returned. Of course, case classes also guarantee a stable order of fields. But the main advantage here is that neither productIterator nor productElement are implemented using any kind of reflection. This means that all of these methods lend themselves admirably to JIT compilation by the JVM.

Efficient object instantiation & injection of data

Each Scala case class comes bundled with a so-called companion object. This singleton object provides a convenience apply method, which takes arguments whose order corresponds to the class' field order, and instantiates an object containing these exact values in its fields. The obvious advantage provided by this code is the fact that a java.lang.reflect.Method handle on the companion's apply can be memoized ahead of time and invoked at run time, with arguments sourced from serialized data.

The astute reader has probably already guessed that these argument lists can't simply be fed blindly to companion's apply, because any such list may be incomplete. Which is why Salat makes use of...

Default case class field values

Case classes in Scala can be defined with default field values. For example, re-using the Person class from just above:

case class Person(name: String, age: Int, education: Option[String] = Some("negligible"))

Then, each invocation of Person$.apply need not provide a value for education. When missing, the default value of Some("negligible") will be used in place automatically.

It must be obvious by now that for each case class field numbered N, where N is between 0 and productArity, a corresponding apply$default$N zero-argument method is provided on the companion object. When invoked, this method will return whatever default value was specified in the case class definition.

While not necessarily helpful in increasing performance, default field values are a convenient Scala feature that saves programmer time. Proper utilization of this feature is thus merited here.

Type hints in DBObject-s

Salat employs a cheat of sorts in order to make deserialization of polymorphic types safer. Every DBObject instance produced by Salat includes a field named _typeHint. The value of this field is a fully-qualified class name, from whose fields the contents of this particular DBObject were taken. This allows Salat to save a great deal of guesswork when the time comes to deserialize members of a polymorphic collection that may include objects belonging to many different concrete types.

Clone this wiki locally