On proper performance testing of Java JSON processing
Lately I have spent little time writing or worrying about the performance of JSON processing on Java platform. As has been repeatedly pointed out, JSON processing is typically NOT amongst biggest bottlenecks, compared to aspects like database access or HTTP request handling overhead.
But lately there seems to have been bit of renaissance on writing simple Java JSON parsers, and typically these new projects also provide performance tests that seek to prove the superior performance of the offering. Alas, while writing performance tests is not exactly rocket surgery, there are many pitfalls that can trip new performance engineers and testers, rendering many of initial reports misleading at best. And while the results are usually corrected over time, based on feedback, first impressions tend to stick ("but wasn't XYZ the fastest things ever?").
So I figured that since I often end up pointing out issues with these tests, I might as well summarize a list of typical problems that plague new performance benchmarj. Maybe this helps in making the whole process more efficient. At the very least I can just send a URL to point to this entry, as a sort of starting point or FAQ.
With that, here is a collection of common problems with Java JSON processing test suites. Many of these problems are also applicable for related test suites, such as those testing other data formats (XML, binary data formats) to large-scale data processing (map/reduce and other "big data").
1. Missing JVM warmup
For developers on non-managed languages (C/C++ for example), testing is relatively simple: after binary code is loaded in (and perhaps test data from disk), system is ready to measured, and there is little variance between test runs, or need to repeat tests for large number of repetitions.
This is very different for Java, since JVM is very much an adaptive system: while program code is loaded as somewhat abstract bytecode, it will be converted to native code (to have anything close to optimal speed), and this occurs based on measurements that JVM itself does to figure out parts that need to be optimized, on-the-fly. In addition, garbage collection will have impact on performance. To complicate things further the standard class library (JDK classes) is a big thing so that its initialization takes much more time than that of native libraries like libc for C. Trying to test Java libraries same way as C libraries is a recipe for disaster.
What this means is that unless care is taken, measurements that do not account for initial startup overhead may well be just measuring efficiency of JVM at initializing itself, and to some degree complexity of the library being tested (since one-time overhead of more complex, or just bigger, libraries is higher than those of simplest libraries). But what is not being tested is eventual steady-state performance of the library. And since this steady-state is what actually matters most for server-side processes, results will be irrelevant.
So unless you really want to just measure the startup time of an application or library, make sure to run test code for a non-trivial amount of time (at minimum, multiple seconds) first before starting actual measurements. Ideally you should also run tests long enough to get stable measurements. Ideally this steady state would be statistically validated, which is one reason why performance test frameworks are very useful for writing performance tests; typically their authors have solved many of the obvious issues.
Another slightly subtles issue is that the order in which code is loaded (and, over time, dynamically optimized) matters as well: often the test that is run first is best optimized by JVM. This means that optimally different tests (tests for different libraries) should be run on separate JVMs, all "warmed up" running specific test case. This is something that most performance benchmarking frameworks can also help with (I am most familiar with Japex, which already runs separate tests on separate JVMs).
2. Trivial payloads
Another common mistake is that of using trivial data: something that is so small that it:
- Is unlikely to be used as data for real production systems, and
- Is so light-weight to process that test mostly checks how much per-invocation overhead library has
In case of JSON, for example, some tests give tiniest data snippest (single String; array with one integer element). Unless your actual use case revolved around such tiny data you probably should not be testing such cases; or at least use wide set of likely input data, to emphasize more common cases.
3. Incorrect input modeling
When testing processing of data formats, data usually comes from outside JVM -- it may read from a storage device or received or sent over network to/from external services. If so, data will arrive as a byte stream of some kind, so the most natural representation is usually java.io.InputSource. Or, if data is length-prefixed, it may be processed by reading it all in a byte array (or ByteBuffer), and offered to library using such abstraction.
But (too) many performance tests start assume that input comes as a java.lang.String. This is peculiar, given that Strings are really things that only live within JVM, and must always be constructed from a byte stream or buffer, decoded from external encoding such UTF-8 or ISO-8859-1. About the only case where input actually arrives as a String are unit tests; or sometimes when another processing components has handled reading and decoding of content.
Now: given that reading and character decoding are generally mandatory steps, how is it actually achieved? Many libraries punt this issue by just declaring that what they accept is just a String (or, sometimes, Reader). This is functionally acceptable as JDK provides simple ways to handle decoding (for example by using java.io.InputStreamReader, or constructing String instances by specifying encoding). But this also happens to be one area where more advanced parsers can optimize processing significantly, by making good use of specific properties of encoding (for example, fact that JSON MUST be encoded using one of only 3 allowed Unicode-based encodings).
So the specific problem of using "too refined input" is two-fold:
- It underestimates real overhead of (JSON) processing, by omitting a mandatory decoding step, and
- It eliminates performance comparison of one important part of the process, essentially punishing libraries that can do encoding step more efficiently than others.
Effects of decoding overhead are non-trivial: for JSON, it is common that UTF-8 decoding can take nearly as much time as actual tokenization (~= "parsing") of decoded input; which is also why quite a bit of effort has been spent to make parsers more efficient at decoding than general-purpose UTF-8 decoders (such as one that JDK comes equipped with).
4. What am I testing again?
Another common mistake is that of vaguely (or not at all) defining of what exactly is being measured. This actually starts even earlier, by using incorrect terminology for the library: most JSON "parsers" are much more than parsers. In fact, I can not think of a single Java JSON library that was just a parser (or perhaps most accurately, tokenizer): most new JSON processing libraries implement or embed a low-level parser but also bundle a higher-level abstraction (either tree model or data binding -- see "three ways to process JSON" for longer discussion on available processing modes). This is not limited to JSON, by the way; with some other data formats like XML things are even worse: many things called parsers (such as DOM or JDOM) do not even include parser themselves! (instead, they use actual low-level (SAX or Stax) XML parser and then just implement tree model on top of actual parser.
But why does it matter? The basic issue is that sometimes comparison is apples to oranges: for example, comparing a simple streaming parser to a data-binding processor (or, one that provides tree model) is not a fair comparison, given that functionality provided is very different, from user perspective.
Going back to "JSON parser" misnomer: some of the tests choose to test performance for specific processing model -- often Tree Model, probably because the original "org.json parser" only offers this abstraction -- but yet claim it as proof of "parser XXX is the Fastest Java JSON parser!". This is incorrect since it bundles together both low-level parsing (which is most efficient to do with a minimal incremental streaming parser) and building (and possibly manipulation) of a Tree model on top. And to give some idea of relative performance: building of a tree model can take more time than parsing (tokenizing) JSON content -- this is similar to XML processing, where building of a DOM tree typically does take more time (often 2x) than low-level parsing, although JSON tree models are usually much simpler than XML tree models.
The important thing here is that test should clearly explain what is being measured: and in cases where differing approaches are compared, what are the trade-offs.