Thursday, March 29, 2012

Jackson 2.0: CSV-compatible as well

(note: for general information on Jackson 2.0.0, see the previous article, "Jackson 2.0.0 released"; or, for XML support, see "Not just for JSON any more -- also in XML")

Now that I talked about XML, it is good to follow up with another commonly used, if somewhat humble data format: Comma-Separated Values ("CSV" for friends and foes).

As you may have guessed... Jackson 2.0 supports CSV as well, via jackson-dataformat-csv project, hosted at GitHub

For attention-span-challenged individuals, checkout Project Page: it contains tutorial that can get you started right away.
For others, let's have a slight detour talking through design, so that additional components involved make some sense.

1. In the beginning there was a prototype

After completing Jackson 1.8, I got to one of my wishlist projects: that of being able to process CSV using Jackson. The reason for this is simple: while simplistic and under-specified, CSV is very commonly used for exchanging tabular datasets.
In fact, it (in variant forms, "pipe-delimited", "tab-delimited" etc) may well be the most widely used data format for things like Map/Reduce (Hadoop) jobs, analytics processing pipelines, and all kinds of scripting systems running on Unix.

2. Problem: not "self-describing"

One immediate challenge is that of lacking information on meaning of data, beyond basic division between rows and columns for data. Compared to JSON, for example, one neither necessarily knows which "property" a value is for, nor actual expected type of the value. All you might know is that row 6 has 12 values, expressed as Strings that look vaguely like numbers or booleans.

But then again, sometimes you do have name mapping as the first row of the document: if so, it represents column names. You still don't have datatype declarations but at least it is a start.

Ideally any library that supports CSV reading and writing should support different commonly used variations; from optional header line (mentioned above) to different separators (while name implies just comma, other characters are commonly used, such as tabs and pipe symbol) and possibly quoting/escaping mechanisms (some variants allow backslash escaping).
And finally, it would be nice to expose both "raw" sequence and high-level data-binding to/from POJOs, similar to how Jackson works with JSON.

3. So expose basic "Schema" abstraction

To unify different ways of defining mapping between property names and columns, Jackson now supports general concept of a Schema. While interface itself is little more than a tag interface (to make it possible to pass an opaque type-specific Schema instance through factories), data-format specific subtypes can and do extend functionality as appropriate.

In case of CSV, Schema (use of which is optional -- more on "raw" access later on) defines:

  1. Names of columns, in order -- this is mandatory
  2. Scalar datatypes columns have: these are coarse types, and this information is optional

Note that the reason that type information is strictly optional is that when it is missing, all data is exposed as Strings; and Jackson databinding has extensive set of standard coercions, meaning that things like numbers are conveniently converted as necessary. Specifying type information, then, can help in validating contents and possibly improving performance.

4. Constructing "CSV Schema" objects

How does one get access to these Schema objects? Two ways: build manually, or construct from a type (Class).

Let's start with latter, using same POJO type as with earlier XML example:


  public enum Gender { MALE, FEMALE };
  // Note: MUST ensure a stable ordering; either alphabetic, or explicit
  // (JDK does not guarantee order of properties)
  @JsonPropertyOrder({ "name", "gender", "verified", "image" })
   public class User {
   public Gender gender;
   public String name;
   public boolean verified;
   public byte[] image;
  }
// note: we could use std ObjectMapper; but CsvMapper has convenience methods CsvMapper mapper = new CsvMapper(); CsvSchema schema = mapper.schemaFor(User.class);

or, if we wanted to do this manually, we would do (omitting types, for now):


  CsvSchema schema = CsvSchema.builder()
.addColumn("name") .addColumn("gender")
.addColumn("verified")
.addColumn("image")
.build();

And there is, in fact, the third source: reading it from the header line. I will leave that as an exercise for readers (check the project home page).

Usage is identical, regardless of the source. Schemas can be used for both reading and writing; for writing they are only mandatory if output of the header line is requested.

5. And databinding we go!

Let's consider the case of reading CSV data from file called "Users.csv", entry by entry. Further, we assume there is no header row to use or skip (if there is, the first entry would be bound from that -- there is no way for parser auto-detect a header row, since its structure is no different from rest of data).

One way to do this would be:


  MappingIterator<Entry> it = mapper
.reader(User.class)
.with(schema)
.readValues(new File("Users.csv"());
List<User> users = new ArrayList<User>();
while (it.hasNextValue()) {
User user = it.nextValue();
// do something?
list.add(user);
}
// done! (FileReader gets closed when we hit the end etc)

Assuming we wanted instead to write CSV, we would use something like this. Note that here we DO want to add the explicit header line for fun:


  // let's force use of Unix linefeeds:
ObjectWriter writer = mapper
.writer(schema.withLineSeparator("\n"));
writer.writeValue(new File("ModifiedUsers.csv"), users);

one feature that we took advantage of here is that CSV generator basically ignores any and all array markers; meaning that there is no difference whether we try writing an array, List or just basic sequence of objects.

6. Data-binding (POJOs) vs "Raw" access

Although full data binding is convenient, sometimes we might just want to deal with a sequence of arrays with String values. You can think of this as an alternative to "JSON Tree Model"; an untyped primitive but very flexible data structure.

All you really have to do is to omit definition of the schema (which will then change observe token sequence); and make sure not to enable handling of header line
For this, code to use (for reading) looks something like:


  CsvMapper mapper = new CsvMapper();
MappingIterator<Object[]> it = mapper
.reader(Object[].class)
.readValues( "1,null\nfoobar\n7,true\n");
Object[] data = it.nextValue();
assertEquals(2, data.length);
// since we have no schema, everything exposed as Strings, really
assertEquals("1", data[0]);
assertEquals("null", data[1]);

Finally, note that use of raw entries is the only way to deal with data that has arbitrary number of columns (unless you just want to add maximum number of bogus columns -- it is ok to have less data than columns).

7. Sequences vs Arrays

One potential inconvenience with access is that by default CSV is exposed as a sequence of "JSON" Objects. This works if you want to read entries one by one.

But you can also configure parser to expose data as an Array of Objects, to make it convenient to read all the data as a Java array or Collection (as mentioned earlier, this is NOT required when writing data, as array markers have no effect on generation).

I will not go into details, beyond pointing out that the configuration to enable addition "virtual array wrapper" is:


mapper.ensable(CsvParser.Feature.WRAP_AS_ARRAY);

and after this you can bind entries as if they came in as an array: both "raw" ones (Object[][]) and typed (List<User> and so on).

8. Limitations

Compared to JSON, CSV is more limited data format. So does this limit usage of Jackson CSV reader?

Yes. The main limitation is that column values need to essentially be scalar values (strings, numbers, booleans). If you do need more structured types, you will need to work around this, usually by adding custom serializers and deserializers: these can then convert structured types into scalar values and back. However, if you end up doing lots of this kind of work, you may consider whether CSV is the right format for you.

9. Test Drive!

As with all the other JSON alternatives, CSV extension is really looking forward to more users! Let us know how things work.

Tuesday, March 27, 2012

Jackson 2.0: now with XML, too!

(note: for general information on Jackson 2.0.0, see the previous article, "Jackson 2.0.0 released")

While Jackson is most well-known as a JSON processor, its data-binding functionality is not tied to JSON format.
Because of this, there have been developments to extend support for XML and related things with Jackson; and in fact support for using JAXB (Java Api for Xml Binding) annotations has been included as an optional add-on since earliest official Jackson versions.

But Jackson 2.0.0 significantly increases the scope of XML-related functionality.

1. Improvements to JAXB annotation support

Optional support for using JAXB annotations (package 'javax.xml.bind' in JDK) became its own Github project with 2.0.

Functionality is provided by com.fasterxml.jackson.databind.AnnotationIntrospector implementation 'com.fasterxml.jackson.module.jaxb.JaxbAnnotationIntrospector', which can be used in addition to (or instead of) the standard 'com.fasterxml.jackson.databind.introspect.JacksonAnnotationIntrospector'.

But beyond becoming main-level project of its own, 2.0 adds to already extensive support for JAXB annotations by:

  • Making @XmlJavaTypeAdapter work for Lists and Maps
  • Adding support for @XmlID and @XmlIDREF -- this was possible due to addition of Object Identity feature in core Jackson databind -- which basically means that Object Graphs (even cyclic ones) can be supported even if only using JAXB annotations.

the second feature (@XmlID, @XmlIDREF) has been the number one request for JAXB annotation support, and we are happy that it now works.
Canonical example of using this feature would be:


    @XmlAccessorType(XmlAccessType.FIELD)
    public class Employee
{ @XmlAttribute @XmlID protected String id; @XmlAttribute protected String name; @XmlIDREF protected Employee manager; @XmlElement(name="report") @XmlIDREF protected List<Employee> reports; public Employee() { reports = new ArrayList<Employee>(); } }

where entries would be serialized such that the first reference to an Employee is serialized fully, and later references use value of 'id' field; conversely, when reading XML back, references get re-created using id values.

2. XML databinding

Support for JAXB annotations may be useful when there is need to provide both JSON and XML representations of data. But to actually produce XML, you need to use something like JAXB or XStream.

Or do you?

One of experimental new projects that Jackson project started a while ago was something called "jackson-xml-databind".
After being developed for a while along with Jackson 1.8 and 1.9, it eventually morphed into project "jackson-dataformat-xml", hosted at Github.

With 2.0.0 we have further improved functionality, added tests; and also worked with developers who have actually used this for production systems.
This means that the module is now considered full supported and no longer an experimental add-on.

So let's have a look at how to use XML databinding.

The very first thing is to create the mapper object. Here we must use a specific sub-class, XmlMapper

  XmlMapper xmlMapper = new XmlMapper();
// internally will use an XmlFactory for parsers, generators

(note: this step differs from some other data formats, like Smile, which only require use of custom JsonFactory sub-class, and can work with default ObjectMapper -- XML is bit trickier to support and thus we need to override some aspects of ObjectMapper)

With a mapper at hand, we can do serialization like so:


  public enum Gender { MALE, FEMALE };
  public class User {
    public Gender gender;
    public String name;
    public boolean verified;
    public byte[] image;
  }

  User user = new User(); // and configure
  String xml = xmlMapper.writeValueAsString(user);

and get XML like:
  <User>
<gender>MALE</gender>
<name>Bob</name>
<verified>true</verified>
<image>BARWJRRWRIWRKF01FK=</image>
</User>

which we could read back as a POJO:

  User userResult = xmlMapper.readValue(xml, User.class);

But beyond basics, we can obviously use annotations for customizing some aspects, like element/attribute distinction, use of namespaces:


  JacksonXmlRootElement("custUser")
public class CustomUser { @JacksonXmlProperty(namespace="http://test") public Gender gender;
@JacksonXmlProperty(localname="myName") public String name; @JacksonXmlProperty(isAttribute=true) public boolean verified; public byte[] image; } // gives XML like:
<custUser verified="true">
<ns:gender xmlns:ns="http://test">MALE</gender>
<myName>Bob</myName>
<image>BARWJRRWRIWRKF01FK=</image>
</custUser>

Apart from this, all standard Jackson databinding features should work: polymorphic type handling, object identity for full object graphs (new with 2.0); even value conversions and base64 encoding!

3. Jackson-based XML serialization for JAX-RS ("move over JAXB!")

So far so good: we can produce and consume XML using powerful Jackson databinding. But the latest platform-level improvement in Java lang is the use of JAX-RS implementations like Jersey. Wouldn't it be nice to make Jersey use Jackson for both JSON and XML? That would remove one previously necessary add-on library (like JAXB).

We think so too, which is why we created "jackson-jaxrs-xml-provider" project, which is the sibling of existing "jackson-jaxrs-json-provider" project.
As with the older JSON provider, by registering this provider you will get automatic data-binding to and from XML, using Jackson XML data handler explained in the previous section.

It is of course worth noting that Jersey (and RESTeasy, CXF) already provide XML databinding using other libraries (usually JAXB), so use of this provider is optional.
So why advocate use of Jackson-based variant? One benefits is good performance -- a bit better than JAXB, and much faster than XStream, as per jvm-serializer benchmark (performance is limited by the underlying XML Stax processor -- but Aalto is wicked fast, not much slower than Jackson).
But more important is simplification of configuration and code: it is all Jackson, so annotations can be shared, and all data-binding power can be used for both representations.

It is most likely that you find this provider useful if the focus has been on producing/consuming JSON, and XML is being added as a secondary addition. If so, this extension is a natural fit.

4. Caveat Emptor

4.1 Asymmetric: "POJO first"

It is worth noting that the main supported use case is that of starting with Java Objects, serializing them as XML, and reading such serialization back as Objects.
And the explicit goal is that ideally all POJOs that can be serialized as JSON should also be serializable (and deserializable back into same Objects) as XML.

But there is no guarantee that any given XML can be mapped to a Java Object: some can be, but not all.

This is mostly due to complexity of XML, and its inherent incompatibility with Object models ("Object/XML impedance mismatch"): for example, there is no counterpart to XML mixed content in Object world. Arbitrary sequences of XML elements are not necessarily supported; and in some cases explicit nesting must be used (as is the case with Lists, arrays).

This means that if you do start with XML, you need to be prepared for possibility that some changes are needed to format, or you need additional steps for deserialization to clean up or transform structures.

4.2 No XML Schema support, mixed content

Jackson XML functionality specifically has zero support for XML Schema. Although we may work in this area, and perhaps help in using XML Schemas for some tasks, your best bet currently is to use tools like XJC from JAXB project: it can generate POJOs from XML Schema.

Mixed content is also out of scope, explicitly. There is no natural representation for it; and it seems pointless to try to fall back to XML-specific representations (like DOM trees). If you need support for "XMLisms", you need to look for XML-centric tools.

4.3 Some root values problematic: Map, List

Although we try to support all Java Object types, there are some unresolved issues with "root values", values that are not referenced via POJO properties but are the starting point of serialization/deserialization. Maps are especially tricky, and we recommend that when using Maps and Lists, you use a wrapper root object, which then references Map(s) and/or List(s).

(it is worth noting that JAXB, too, has issues with Map handling in general: XML and Maps do not mesh particularly well, unlike JSON and Maps).

4.4 JsonNode not as useful as with JSON

Finally, Jackson Tree Model, as expressed by JsonNodes, does not necessarily work well with XML either. Problem here is partially general challenges of dealing with Maps (see above); but there is the additional problem that whereas POJO-based data binder can hide some of work-arounds, this is not the case with JsonNode.

So: you can deserialize all kinds of XML as JsonNodes; and you can serialize all kinds of JsonNodes as XML, but round-tripping might not work. If tree model is your thing, you may be better off using XML-specific tree models such as XOM, DOM4J, JDOM or plain old DOM.

5. Come and help us make it Even Better!

At this point we believe that Jackson provides a nice alternative for existing XML producing/consuming toolkits. But what will really make it the first-class package is Your Help -- with increased usage we can improve quality and further extend usability, ergonomics and design.

So if you are at all interested in dealing with XML, consider trying out Jackson XML functionality!

Monday, March 26, 2012

Jackson 2.0.0 released: going GitHub, handling cyclic graphs, builder style...

After furious weeks of coding and testing, the first major version of upgrade of Jackson is here: 2.0 was just released, and available (from Download page, for example)

1. Major Upgrade?

So how does this upgrade differ from "minor" upgrades (like, from 1.8 to 1.9)? Difference is not based on amount of new functionality introduced -- most Jackson 'minor' releases have contained as much new stuff as major releases of other projects -- although 2.0 does indeed pack up lots of goodies.

Rather, major version bump indicates that code that uses Jackson 1.x is neither backwards nor forwards compatible with Jackson 2.0.
That is, you can not just replace 1.9 jars with 2.0 and hope that things work. They will not.

Why not? 2.0 code differs from 1.x with respect to packaging, such that:

  1. Java package used is "com.fasterxml.jackson" (instead of "org.codehaus.jackson")
  2. Maven group ids begin with "com.fasterxml.jackson" (instead of "org.codehaus.jackson")
  3. Maven artifact ids have change a bit too (core has been split into "core" and "annotations", for example)

These are actually not big changes in and of itself: you just need to change Maven dependencies, and for Java package, change import statements. While some amount of work, these are mechanical changes. But it does mean that upgrade is not basic plug-n-play operation.

In addition, some classes have moved within package hierarchy, to better align functional areas. Some have been refactored or carved (most notably, SerializationConfig.Feature is now simply SerializationFeature, and DeserializationConfig.Feature is now DeserializationFeature). Most cases of types moving should be easy to solve with IDEs, but we will also try to collect some sort of upgrade guide.

For more details on packaging changes, check out "Jackson 2.0 release notes" page.

1.1 Why changes to package names?

The reason for choosing to move to new packages is to allow both Jackson 1.x and Jackson 2.x versions to be used concurrently. While smaller projects will find it easier to just convert wholesale, many bigger systems and (especially) frameworks will find ability to do incremental upgrades useful. Without repackaging one would have to upgrade in "all-or-nothing" way. But with repackaging this can be avoided, and existing functionality converted gradually (within some limits; transitive dependencies may still be problematic).

2. But wait! It is totally worth it!

I started with the "bad news" first, to get that out of the way, since there is lots to like about the new version.
I will write more detailed articles on specific features later on, but let's start with a brief overview.

2.1 Community improvements: Better collaboration with GitHub

First big change is that Jackson project as a whole has moved to Github. While many extension projects (modules) had already started there, now all core components have moved as well:

as well as standard extension components such as:

and many, many more (total project count is 17!)

This should help make it much easier to contribute to projects; as well as make it easier for packages to evolve at appropriate pace: there is less need to synchronize "big" releases outside of 3 core packages, and it is much easier to give scoped access to new contributors.

2.2 Feature: Handle Any Object Graphs, even Cyclic ones!

One of biggest so far unsupported use case been ability to handle serialization and deserialization of cyclic graphs, and elimination of duplicates due to shared references. Although existing @JsonManagedReference annotation works for some cases (esp. many ORM-induced parent/child cases), there has been no general solution.

But now there is. Jackson 2.0 adds support for concept called "Object Identity":ability to serialize Object Id for values, use this id for secondary references; and ability to resolve these references when deserializing). This feature has many similarities to "Polymorphic Type information" handling which was introduced in Jackson 1.5.

Although full explanation of how things work deserves its own article, the basic idea is simple: you will need to annotate classes with new annotation @JsonIdentityInfo (or, use it for properties that reference type for which to add support), similar to how @JsonTypeInfo is used for including type id:


  @JsonIdentityInfo(generator=ObjectIdGenerators.IntSequenceGenerator.class, property="@id")
  public class Identifiable {
    public int value;

    public Identifiable next;
  }

and with such definition, you could serialize following cyclic two-node graph:


Identifiable ob1 = new Identifiable(); ob1.value = 13; Identifiable ob2 = new Identifiable(); ob2.value = 42; // link as a cycle: ob1.next = ob2; ob2.next = ob1; // and serialize! String json = objectMapper.writeValueAsString(ob1);

to get JSON like:

  {
   "@id" : 1,
   "value" : 13,
   "next" : {
    "@id" : 2,
    "value" : 42,
    "next" : 1
   }
  }

and obvious deserialize it back with:

  Identifiable result = objectMapper.readValue(json, Identifiable.class);

assertSame(ob1.next.next, ob1);

Most details (such as id generation algorithm used, property use for inclusions etc) are configurable; more on this on a later article.
Until then, Javadocs should help.

2.3 Feature: Support "Builder" style of POJO construction

Another highly-requested feature has been ability to support POJOs created using "Builder" style. This means that POJOs are created using a separate Builder object which has methods for changing property values; and a "build" method that will create actual immutable POJO instance. For example, considering following hypothetical Builder class:


 public class ValueBuilder {
  private int x, y;

  // can use @JsonCreator to use non-default ctor, inject values etc
  public ValueBuilder() { }

  // if name is "withXxx", works as is: otherwise use @JsonProperty("x") or @JsonSetter("x")!
  public ValueBuilder withX(int x) {
    this.x = x;
    return this; // or, construct new instance, return that
  }
  public ValueBuilder withY(int y) {
    this.y = y;
    return this;
  }
  public Value build() {
    return new Value(x, y);
  }
}

and value class it creates:

@JsonDeserialize(builder=ValueBuilder.class) // important!
public class Value {
  private final int x, y;
  protected Value(int x, int y) {
    this.x = x;
    this.y = y;
  }
}

we would just use it as expected, as long annotations have been used as shown above:

  Value v = objectMapper.readValue(json, Value.class);

and it "just works"

2.4 Ergonomics: Simpler, more powerful configuration

Although ObjectMapper's immutable friends -- ObjectReader and ObjectWriter -- were introduced much earlier, 2.0 will give more firepower for both, making them in many ways superior to use of ObjectMapper. In fact, while you can still pass ObjectMappers and create ObjectReaders, ObjectWriters on the fly, it is recommend that you use latter if possible.

So what was the problem solved? Basically, ObjectMapper is thread-safe if (and only if!) it is fully configured before its first use. This means that you can not (or, at least, not supposed) to try to change its configuration once you have used it. To further complicate things, some configuration options would work even if used after first read or write, whereas others would not, or would only work in seemingly arbitrary cases (depending on what was cached).

On the other hand, ObjectReader and ObjectWriter are fully immutable and thus thread-safe, but would also allow creation of newly configured instances. But while this allowed handling of some cases -- such as that of using different JSON View for deserialization -- number of methods available for reconfiguration was limited.

Jackson 2.0 adds significant number of new fluent methods for ObjectReader and ObjectWriter to reconfigure things; and most notably, it is now possible to change serialization and deserialization features (SerializationFeature, DeserializationFeature, as noted earlier). So, to, say, serialize a value using "pretty printer" you could use:

  ObjectWriter writer = ObjectMapper.writer(); // there are also many other convenience versions...
  writer.withDefaultPrettyPrinter().writeValue(resultFile, value);

or to enable "root element" wrapping AND specifying alternative wrapper property name:

  String json = writer
    .with(SerializationFeature.WRAP_ROOT_VALUE)
    .withRootName("wrapper")
    .writeValueAsString(value);

basically, anything that can work on per-call basis will now work through either ObjectReader (for deserialization) or ObjectWriter (for serialization).

2.5 Feature parity: JSON Views for deserialization

One of frustrations with Jackson 1.x has been that all filtering functionality has been limited to serialization side. Not any more: it is now possible to use JSON Views for deserialization as well:

  Value v = mapper
   .reader(Value.class)
   .withView(MyView.class) 
   .readValue(json);

and if input happened to contain properties not included in the view, values would be ignored without setting matching POJO properties.

2.6 Custom annotations using Annotation Bundles

Another ergonomic feature is so-called "annotation bundles". Basically, by addition of meta-annotation @JacksonAnnotationsInside, it is now possible to specify that annotations from a given (custom) annotations should be introspected and used same way as if annotations were directly included. So, for example you could define following annotation:

  @Retention(RetentionPolicy.RUNTIME)
  @JacksonAnnotationsInside
  @JsonInclude(Include.NON_NULL) // only include non-null properties
  @JsonPropertyOrder({ "id", "name" }) // ensure that 'id' and 'name' are always serialized before other properties
  private @interface StdAnnotations

and use it for POJO types as a short-hand:

  @StdAnnotations
  public class Pojo { ... }

instead of separately adding multiple annotations.

2.7 @JsonUnwrapped.prefix / suffix

One more cool new addition is for @JsonUnwrapped annotation (introduced in 1.9). It is now possibly to define prefix and/or suffix to use for "unwrapped" properties, like so:

  public class Box {
    @JsonUnwrapped(prefix="topLeft") Point tl;
    @JsonUnwrapped(prefix="bottomRight") Point br;
  }
  public class Point {
    int x, y;
  }

which would result in JSON like:

  {
   "topLeft.x" : 0,
   "topLeft.y" : 0,
   "bottomRight.x" : 100,
   "bottomRight.y" : 80  
  }

This feature basically allows for scoping things to avoid naming collisions. It can also be used for fancier stuff, such as binding of 'flat' properties into hierarchic POJOs... but more on this in a follow-up article.

3.0 And that's most of it, Folks!

At least for now. Stay tuned!

EDIT:

Links to the continuing "Jackson 2.0 saga":



Related Blogs

(by Author (topics))

Powered By

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.