Tuesday, July 27, 2010

Fun with bytecode generation, minus problems with generic signatures

One thing that has been on my TODO list for multiple years moved up to my "current tasks" list: that of using Java bytecode generation for something useful: for "materializing interfaces" (typically simple Bean interfaces), that is, constructing concrete implementations of interfaces that only consist of simple getter and/or setter methods. It will be one of major new features for Jackson 1.6, codename "Mr Bean". In fact there is just one other such mental todo task from that era (which would be "use Groovy for something useful")...

Implementation was started by my colleague Sunny G at Ning (which is big reason why it got started at all!) and uses minimalistic, lean and mean ASM bytecode generation package for actual low-level work. This is good in keeping dependencies to minimal (in fact, Jackson mrbean module will embed repackage version of ASM 3.3), and will probably also result in simple bytecode. But the downside is that code to write is quite low-level.

Working with low-level code generation actually brings back memories from last millennium -- I wrote 68000 assembly in early 90s, and 6502 machine code before that in late 80s. Name ASM is actually quite apt for the package as using ASM is quote similar to using full-feature Assembler package, which is still a notch above bare machine code or using machine code monitor (for those who remember what that meant on, say, Commodore-64). This is to say, it's actually bit of fun for me; although amount of time spent on actual debugging and troubleshooting is considerably higher than with "normal" Java coding. But at least you don't have to manually calculate various offsets (local variable and stack sizes for example), or allocate constant Strings in per-class constant tables. Writing actual code is still rather involved however.

At this point basic functionality works acceptably, leading to the first snapshot release of Jackson-1.6.0. But there is one serious problem that I have been unable to resolve. Although ASM actually does provide some support for dealing with generics -- that is, ability to declare generic signatures of fields, methods, and supertypes -- and although I am using functionality exactly as it should be (including verifying exact form signatures need, compared to type-erased 'description' parameters), end result is only half-working. Half meaning that if disassembling resulting class bytecode using ASM itself, generic signatures are found; but when using JDK reflection (java.lang.reflect.Method/Field, getGenericType()), only type-erased information is found. I still hope to resolve this; I hope someone on ASM users list can point out what I am doing wrong. There shouldn't be anything fundamentally hard about making it work.

Anyhow, I thought this might be somewhat interesting; I'll get back to working on finalizing 1.6.0 so it gets released before summer ends and it is time to (yet again!) contemplate brewing of dreaded Blackberry Beer! :-)

Saturday, July 10, 2010

Every day JSON work: comparing JSON documents for equality ("are these JSONS same?")

One JSON task that may seem trivial but is not is that of checking whether two given JSON documents are equal: that is, have same structure and values, and ordering of array values is same. It is not sufficient to use String equality comparison because while equality of Strings does mean contained JSON is equal, differing Strings does not mean they are not equal, since:

  • It is possible to quote characters differently using \uXXXX mechanism
  • Order of properties of JSON objects is undefined (at conceptual level); meaning that physical ordering may differ while conceptually JSON is still identical

So how should one compare equality? With Jackson, simplest way is to construct JSON tree (JsonNode) for documents to compare and just use JsonNode.equals()! Like so:

  ObjectMapper mapper = new ObjectMapper();
  JsonNode tree1 = mapper.readTree(input1);
  JsonNode tree2 = mapper.readTree(input2);
  boolean areEqual = tree1.equals(tree2);

I realized that this (fact that JsonNode.equals() works "as expected", doing deep value comparison) is not documented outside of Javadocs (it will be now, I will add something at FasterXML Jackson Wiki). But hopefully it was already known by experienced Jackson users. Either way, here's how JSON content comparison can be done easily and reliably.

Friday, July 02, 2010

Comments, yes please!

Hurrah! Thanks to Bryce's excellent recommendation, commenting is again possible here using Discus (which I had seen being used by a few blogs).

To celebrate this new-found capability, I will be taking next week off and will neither be writing entries nor reading any comments added. :-)

... but when I get back, there's bumper crop of Jackson 1.6 features to write about, from kick-ass "binary JSON" format (aka "Smile"), to automatically generated interface implementations and tons more smaller improvements to usability. Latter mostly due to lately having to eat my own dog food (and having to serve it for co-workers as well!) on regular basis.

Wednesday, June 30, 2010

No Comments

Ok, another minor but important change that just occured is that at present there is unfortunately no way to comment on my writing style, content, or anything else, on this blog.

Believe it or not this not due my sensitive skin or low self-esteem. It is actually due to surprising sticker shock I got when looking into actual cost of continuing with my formerly-free comment-extension provider. As I mentioned earlier, I am not against paying for things I use; and I can also get over immediate knee-jerk reaction to what may appear as bait-n-switch. So it is not about unwillingness to pay. But I am not willing to pay more for simple comment extension than I pay for hosting the whole thing (including shell access, plenty of storage and transfer space) -- this just does not seem like a reasonable value proposition. As a rough estimate of what I would pay, suitable price should be something similar to lowest fee one has to pay for a custom social network at Ning (20$ / year). That I would be ok with.

So, until I figure out suitable replacement, you will just have to bite your tongue, or send comments directly to me (both of which you are welcome to do! :-) ).

ps. If your comment would be about a free commenting system I could use, please do send it to me -- contact info is available via link next to "About me".

Tuesday, June 29, 2010

Experiments in advertising, here goes nothing (aka Welcome, AdBrite!)

Ok let's talk about something that is quite visible to you dear readers, but something that you have probably managed to ignore automatically. Yes, I am taking about those commercial decorations on margin of these pages. But please, don't change the channel quite yet. :-)

1. Advertising Changes... yay!

So what's up there? After being a very small AdSense publisher for few years, I figured that I might well retire before ever seeing another check for ads displayed on this blog; so it might be time to explore options: if not to get higher yields then at least maybe get more interesting ads. I also generally root for underdogs, and at this point Google is the ultimate uber-dog if there ever was one. So why not partner up with some other advertising puppies.

Given these loose goals about the only criteria for finding a replacement would be that it is not Google. And, well, ideally it should not be Apple, and preferably not Microsoft. But latter two are negotiable constraints (in fact, I am tempted to check out M$'s PubCenter; if for nothing else due to its catchy name!).

2. So... ?

But enough background discussion: in the end, I decided to change my ad provider from the big G to an unknown-before-about-a-week-ago company called AdBrite. Mostly because they topped this Handy List of Google Adsense Alternatives. And finally, as of today, I bothered to change blog templates for the change to take effect.

3. Can hardly contain my excitement <yawn>

At this point I am curious to see to what kind of ads they might be pushing to my blog. I sort of wish it was something that lots of people found totally repugnant yet completely fascinating... but chances for that are probably low. We'll see -- maybe I need to cycle through variety of ad sales networks before choosing my poison.

4. Commercial Proposal by Author

By the way, if anyone actually wants to actually advertise here -- buy a section for month-by-month advertising, selling something that actually relates to something I have written about -- let me know. I am open to bids and can show google analytics statistics for pricing, so you have a fair idea of what you'd get.
The only limit I will put is that monthly ad space rental fee has to be non-zero positive number in full US dollars. :-)
(you can consider it as the auction starting price)

Monday, June 28, 2010

Async HTTP Client 1.0 released

Ok, this announcement went out last week, but it is important to re-iterate: version 1.0 of Ning async HTTP Client is now out!

This is good news for multiple reasons: obviously ability to do asynchronous (read: non-blocking, i.e. no need to use one thread for each and every open connection) HTTP communication is valuable in itself. But I also hope that this spurs some friendly competition in improving all HTTP-based communication on Java platform -- there are still features that are not implemented, and due to complexity of efficient connection handling, there should be still room for improvement with performance as well. And finally this should lead to wider adoption of this relatively new library, so that it gets properly battle-tested and proven.

I am actually planning on using this client for cases where regular blocking client could work, to see how well it performs and exactly how easy it is to use non-blocking API, compared to existing blocking alternatives.

Tuesday, June 08, 2010

Introducing My Newest Open Source project: Tr13 (that is so 1337!)

I have briefly mentioned that I have been working on a new little low-level data structure package, called (Ning) Tr13, started at GitHub. Now that code is getting ready and tested, it is time to "formally" introduce it.

Tr13 is an interesting project form multiple angles: for one, it is my first "corporate-sponsored" Open Source project (i.e. related to may day-to-day work -- although not focus of it, just a useful thing for me and hopefully my colleagues at Ning). This alone is worth celebrating; the whole process has been remarkably smooth from 15-minute end-to-end approval process (take note Amazon developer community...), tosteady if somewhat slow development itself (there is nothing groundbreaking from CS perspective -- more on this below).
This is also a rare case of my already having reasonably clear idea as to where I would be using this package. It may sound surprising, but truthfully most open source libraries I have written, I have not actually absolutely needed them before writing. :-)
(there are exceptions: Java UUID Generator is something I wrote since I needed it -- but not so for Woodstox, or even Jackson, to some degree, both of which I have extensively used after I wrote them).
And finally, well, it is a change from my more common data format parsing / data binding work. Although I was thinking of "climbing up the stack" to write something higher level, I ended up going the other way, a slight detour.

Anyway, I digress. So...

What exactly is tr13?

It is a space-efficient ("compact") read-only implementation of Trie data structure, with slight modification to use Patricia/Radix structure for leaves (but not for branches, since that did not seem useful for my use cases).

Current implementation supports two kinds of Tries:

  1. Byte[] key, variable integer (~= long) value
  2. Byte[] key, byte[] value

In both cases, byte[] is typically a String, but can be any other serializable thing; for example, ints and longs could be trivially converted to 4/8 byte keys. Separate variant for variable-length integers is further optimization for my initial use case, where values are generally small ints, which can fit in a single byte, but where it is convenient not to be bound by arbitrary limits (like 256).

Internally structure is actually just a regular byte array (or, alternatively, ByteBuffer, to allow for "GC-free" tries!). Data format needs to be documented, but is a rather simple implementation of basic trie ideas, nothing to write a dissertation about.

Why does it Rock?!

Given how good plain old JDK HashMap can be -- you could just use HashMap<String,Integer> for the first case -- what is so special about tr13?

Simply put: memory usage, or lack thereof (ditto for GC activity). This because:

  1. Overall memory usage is pretty tight -- for test data I have (30 million entries of typically 13-byte key, small integer number == 600 meg input file), trie structure actually uses less memory than its file representation (by about 35% less). And,
  2. Since it is all in just TWO objects (raw byte array, wrapper Trie object), it should keep GC rather happy.

This is not coincidental, since the use case we have is that of sizable routing tabls that should ideally be kept fully in-memory. Hence compactness of data structure, with reasonable lookup times (i.e. no linear search), was design criteria. Even other somewhat frugal data structures (like b-trees) are not anywhere as space efficient.

Downsides?

First obvious limitation is that it is a read-only data structure: you must have all the data at hand when building it. Data also has to be ordered, not necessarily alphabetically (although that is the usual way), but in a stable lexicographic ordering, such that build process has invariants it needs. This is not as big a limitation as it might seem, at least when merging functionality (more on this later on) is added. For now it may be necessary to re-build data structure regularly, and keep an additional "short-term" lookup data structure at hand for additions that are needed between re-builds.

As to performancem resulting data structure is not the fastest way to find things: from complexity standpoint it is a linear algorithm, compared to theoretically constant time for hash-based alternatives. This may or may not matter; it should not matter a lot with respect to key length (HashMap has to calculate hash for key and compare key bytes, unless keys can be canonicalized). Another potential performance concern is that trie data structure does not have much locality; but this too is unlikely to be much worse than for typical alternatives. In fact, due to high memory efficiency, and use of plain byte array, it is not guaranteed that HashMap has any better locality, even if a single lookup results in a match (this due Java GC's habit of "shuffling" object data around).

All in all, Tr13 does much more work when finding values, since it traverses data structure once per each key byte (except for Patricia-optimized tail, which may be multi-byte linear match). So I would not expect it to be as fast as direct hash lookups.

But in the end, these are speculations, and the real test would be use.

... or, how about a performance test?

Performance Testing

Ok: the question of efficiency is something worth looking deeper into. And so I wrote a benchmark that compares VIntTrieLookup (byte[]-to-vint trie) against HashMap<String,Integer> (for curious, it's under 'src/java/test/manual/' in source tree). Test loads all the entries in memory, and then performs randomized lookups (one lookup for every key, but order shuffled so as to be different from 'natural' ordering of trie).

When given a subset of the whole data file (turns out HashMap hogs so much memory that I just can't test the whole data set!) -- a 32 meg, 2 million entry chunk -- here are the results:

Memory usage:

  1. Tr13: 20.6 megs (reduction of about one third)
  2. HashMap<String,Integer>: 314 megs

So in this case, HashMap<String,Integer> uses about 15 times more memory than Tr13 (note: Integer values are canonicalized using Integer.valueOf -- without this, usage would be 360 megs).

And what about performance? What is the cost of compactness?

For 3 million lookups, time taken was:

  1. Tr13: about 5000 milliseconds (about 400,000 lookups per second)
  2. HashMap<String,Integer>: about 1200 milliseconds (about 1,670,000 lookups per second)

so HashMap is about 4 times faster than Tr13 for lookups, on my system (2.33 ghz mini-mac, JDK 1.6), for given data set.

Since my first need is to be able to do lookups in memory, performance difference is not of particular importance (there might be one or two lookups per request, so anything in tens of thousands per second would work), but it is good to know rough order of magnitude difference.

What next?

At this point Tr13 should be ready for use, and I plan to release version 0.5 soon. The only thing I am waiting for is to get a Maven repository set up at Sonatype OSS repository (kudos to Sonatype for this free hosting!), so that Maven-based (and I assume, Ivy-based) projects can use it. This is actually what I need for my daytime use.

But beyond Maven deployment, there are some additional features I want to implement in near future, such as:

  • Ability to traverse through entries (Map.Entry<> style)
  • Ability to merge two tries efficiently -- could be built quite easily given iterator functionality
  • ... document the data structure (ok ok, not a development task, but "hard" (read: tedious) yet necessary work) -- it is very simple, but not so simple that reading code would make it obvious.
  • Current implementations are limited to 2 gigs, due to Java byte[] and ByteBuffer limitations. But fundamentally there is nothing particularly hard about writing segmented versions where only segments would have this limitation (well, probably values too, but I don't see any need for 2gb+ values in my tries...)

Merging functionality specifically would be useful for "semi-mutable" lookup data structures. For data that evolves slowly (which is the case for routing table use case I have), it is possible to keep track of needed modifications (in a, say, HashMap), and do primary lookup against that data structure, and only if no match is found, against immutable trie. And when this short-term data structure goes enough, or enough time passes, trie could be re-built with existing and new data. Repeat and rinse.

And perhaps it might even be possible to work some more on fundamental data structure itself, and find ways to make it mutable. Insertions (and deletion) operations would be rather costly, due to global changes needed (another downside of compactness is that changes tend to spill far). But even slow mutability would be better than no mutability; and perhaps there could be ways to trade some space (intentional unused regions within trie, to be used for insertions?) for lesser likelihood of having move all the data around.

Monday, May 31, 2010

On prioritizing Open Source projects, tasks: First In, First Out

I have been bit less productive with my "extra-curricular" activities (open source coding, writing blogs) lately. This is mostly due to intense focus on my paying daytime job, and is part of the natural cycle of things for me.
But altohugh I have had bit less time and energy to spend on these tasks, I have had relatively more time to think about things. It turns out that amount of time I have to think about things does not quite correlate with amount of time to do things (which is interesting thing in its own right, but maybe worth a separate blog entry). More thinking often leads to having more ideas for writing blog entries; even if writing activity itself is still constrained by time crunch.

1. FIFO as a priorization mechanism

Anyway: I have realized that a fundamental operating principle that I have with respect to managing my open source projects -- tasks within projects, relative focus between projects -- is that of trying to tackle oldest issues first. Good old First-in-First-out (FIFO) queuing of things.
Except for high-priority urgent bug fixes, which I generally fast track (but which fortunately are not very common), I do try to increase priority of issues that have not been fixed. This is actually quite different from many task priorization methods (even if it's well-known algorithm for operating system process priorization), but which I feel is an important part of maintaining "culture of excellence", ensuring that qualtity of project output remains high. And to reduce risk of letting most of your open source projects stagnate to death.

Thing is: as with fish (and guests, as per Mark Twaine), bugs also smell more longer they stay. Smell of code rot comes from long-standing unresolved issues. It also tends to be the case that it is easier to keep momentum than to get things rolling -- this makes it hard to revive stagnant projects; but possible to keep recently(-enough) worked-on projects chugging along.

Now: it is bit more well understood that teams should at least occasionally go through list of long-standing issues and ideally not only re-visit them but also fix. In for-profit development targets and priorities tend to fluctuate more than with labour-of-love projects that most open source projects still are. This tends to make it less likely that older issues get resolved, as priorities are more driven by who cries loudest, and most recently. Still, most good developers I know are uncomfortable leaving issues unresolved before starting to extend functionality, to build new things.

But I think FIFO principle as driving force for priorization goes beyond increasing priority of old tasks. I think it also should (and does, in my case) drive relative priorities of different projects (or services, systems, libs, frameworks). And this is something I also try to do more.

2. Example case: my own project priorization ("product backlog")

Specific case in point is that of my focus on my current "flagship" project, Jackson JSON processor. Jackson has had my main focus for well over a year now -- mostly since it is by far the most popular of things I have built; and will likely remain so for a while now. So if I chose to, I could spend all my time and then some just working on issues related to Jackson.
Doing so would, however, essentially kill other projects I am heavily involved with -- Woodstox, StaxMate, Aalto, Java Uuid Generator -- as well as prevent me from expanding to new areas (like upcoming "compact trie" package, google for "ning tr13"; more on this once it is ready to be announced). And that is not something I find compelling as an idea. Especially since code that is not worked on will start to rot at fast rate, and becomes useless surprisingly fast.

So: as frustrating it is to watch issues stack for important projects like Jackson, the way I try to do things is to cycle through projects that I consider worth keeping alive. Sometimes it just means flipping between two projects; and some projects may even become complete in their own right, meaning there just isn't that much to work on, and thus the cost of switching focus for some minor work just isn't worth it. But to give further example of what I mean, here is my current thinking of roughly how I hope to complete next tasks, with respect to projects I work on (note: priorities are not absolute or carved in stone; they are akin to Scrum sprint plan, if even that solid):

  1. Complete Java Uuid Generator rewrite to version 3.0
  2. Work on Woodstox 4.1, focusing on XML Schema handling improvements (working with couple of other OS developers who have stake in this area -- including work on Sun Multi-Schema Validator)
  3. Finish version 1.0 for the "compact trie" project
  4. Implement minor extensions for StaxMate, to get to version 2.1
  5. Jackson version 1.6: must have better Enum type handling; should have "materialized interfaces"; can not be delayed too long since there's always version 1.7 to write...
  6. Compact binary format alternative for JSON (with Jackson as reference implementation)
  7. ... maybe consider implementing "DataMate" (if you haven't heard my brainstorming on what it is, consider yourself lucky :) )
  8. Cycle through projects again!

So why consider work on JUG as the first priority? Well, for one, time is right to both upgrade it to be useful and relevant -- there are things like JDK 1.4 - introduced java.util.Uuid class; JDK-1.6 introduced access to Etherner Mac address; as well as "new" use cases (Cassandra, for example, uses Time-based UUIDs heavily!). And as importantly, work has reasonably limited scope: it will take about week more of focus to get it all done; as I have already spend some time over past month or so to make this reality.

And Woodstox? While XML does not have same momentum in J2EE world as it once had, Woodstox is very widely used, complete and useful library. But its XML Schema support has only recently been more heavily exercised, and multiple implementation flaws have been incovered. To make Woodstox relevant as a first-class Java XML Schema supporting tool, some work is needed. Further, there are some useful improvements in trunk that can only be released with version 4.1 (retro-fitting to 4.0 would be too risky).

I think you get the idea -- maybe not exactly why I feel this order is sensible, but at least see that there are multiple conflicting factors. I guess I know my own working habits well enough to know that "out of sight, out of mind", meaning that longer a project is not being worked on, harder it will be to get back to work on it. And so the best way to keep all the balls in the air is to juggle through them; and do this as a semi-formal process and not rely on user request to trigger such changes (esp. since there are more requests for work than time for it).

Wednesday, May 19, 2010

Un-hibernating projects: Java Uuid Generator, getting ready for 3.0!

As cycle of seasons has rolled to late spring, it is time for hibernating things -- bears, and stagnant open source projects -- to wake up and start moving. It just so happens that this is the case with venerable Java UUID Generator (JUG): my first true Open Source project.
Although it is not exactly the first thing that I ever released as open source (that would be something called "NetReaper", or perhaps "DLR", both from late 90s -- few have ever heard of them -- heck, even "Fractalizer" is older!), and much less the first piece of software I have released (shareware lib/app that compressed Amiga Soundtracker files using delta compression is probably the first one, from late 80s!), I count it as basically starting point of my open source "career".

So what is happening? Well: there is the new JUG project page (at the OS darling GitHub); matching skeletal JUG product page at FasterXML; and of course the brand new JUG users discussion group (java-uuid-generator-users) at Google Groups, waiting for users to talk about it. And probably most interestingly, actual development effort to produce third major version, 3.0.

Given that the project has spent past 5 or so years changing very little, why is there new development effort? Mostly because JDK finally caught up with JUG, so to speak -- JDK 1.6 finally has a pure Java method of accessing Ethernet interface MAC addresses -- but partly also because of other niceties that can now be added (java.util.UUID was added in JDK 1.4; which was not the stable version at the time of writing JUG 2.0). And finally, there's quite a bit of clean up that would be nice to do if I was to work on the code.

Given above, here are the modest goals for version 3.0

  • Add convenient support for using local Ethernet address, without using JNI library (requires JDK 1.6); and remove legacy code that was needed for JNI
  • Change UUID type to use from JUG-specific to java.util.UUID (also allows removing quite a bit of code)
  • Build/deployment changes: change build to Maven (including releasing builds to Maven repos); jars built as OSGi bundles as well
  • SCM changes: move from Safehaus/svn to GitHub/git
  • Improve API to avoid relying heavily on singletons; streamline for simpler (and perhaps more elegant) access
  • Add support for one "new" UUID generation method (using SHA-1 instead of MD5 for name/hash based generation)
  • Maybe even write a simple tutorial for using the lib!

Which is just to say, renovate the package so it does not feel quite so 2002 any more (which is when it was written originally). :-)

Monday, May 17, 2010

Finally: Java JSON Schema validator; based on Jackson

Ok here are some good news for Java developers who use (or might like to use) JSON, but are bothered by lack of data format validation options: Nicolas Vahlas has written the first Java JSON Schema validator, and it is available from Gitorious as project json-schema-validator.

I have not yet have time to dig deep into it, but there all signs for it being so-called Good Stuff. Not just because it is based on Jackson -- although proper reuse of existing solid components is a general good sign -- but because description gives an idea of author being someone actually knows what he is doing.

So please check it out if said functionality seems at all interesting: the best way to ensure it becomes a first-class tool (and maybe even help JSON Schema standard improve along the way) is to use it, give feedback, and get the whole flywheel-of-virtue (aka virtuous cycle) thing going on. That's how things like Jackson and Woodstox became good: feedback is the amplifier of open source productivity.

Ok, enough raving: I'm off to get the sources for bit of closer look. :-)


Last posts


More Ads? Yes Sir!


Related Blogs

(by Author (topics))

Powered By

Powered by Thingamablog,
Blogger Templates and Discus comments.

About me

  • I am known as Cowtowncoder
  • Contact me at@yahoo.com
Check my profile to learn more.