Each comment must start with the words “sorry, …”

I mean, if you really need to write one. When you put a comment, you implicitly admit your inability to communicate your ideas trough your code, you are basically saying “sorry, I do not know enough of this language to express myself decently, so I would put some content in another language to make things clear”.

How dare you? I will have to use your code at a point in time! I will need to go and change it, and I will need to trough your rotten series of comments, I will have to git blame here and there, use aggressively grep, and then spend precious hours of my life that I will never get back in order to understand what you so poorly communicated in your code. How disrespectful of you. At least apologise.

So, new rule for my teams since today.

Java: no timeout on DNS resolution

Breaking news! I just discovered (well. actually yesterday) that it’s not possible to set a timeout on DNS resolution in Java, which relies on the underlying OS. This is NOT good when you have any shape or form of SLA or QOS, you basically potentially throwing it out of the window!

I suggest you do something about it, this is the code I pushed on msnos (see here code and test), basically it uses a Threadpool and a Future to make the magic happen:

import org.apache.http.conn.DnsResolver;
public class DnsResolverWithTimeout implements DnsResolver {

    public InetAddress[] resolve(final String host) throws UnknownHostException {

        Future<InetAddress[]> result = executor.submit(new Callable<InetAddress[]>() {
            public InetAddress[] call() throws Exception {
                return systemResolver.resolve(host);

        try {
            return result.get(timeoutInMillis, TimeUnit.MILLISECONDS);
        } catch (InterruptedException e) {
            log.warn("Unexpected interrupton while resolving host "+host, e);
        } catch (ExecutionException e) {
            log.warn("Unexpected execution exception", e.getCause());
        } catch (TimeoutException e) {
            log.warn("Timeout of {} millis elapsed resolving host {}", timeoutInMillis, host);

        throw new UnknownHostException(host + ": DNS timeout");

Of course you can make sure your OS is behaving, but you may not have such luxury :)

Microservices Antipatterns [2]

Looking forward to the upcoming Geecon conference in Prague I am trying to identify new antipattern and I have to say that’s not that difficult, having been working with them here in Workshare for almost three years. The more we use them, the more dead road we find :) and it’s a natural darwinian process I guess: regardless of all what you can find now on the internet, sometimes it’s only trough your own experience that you learn. And, as Uncle Bob says, “you learn more from your mistakes than from your successes :). So let’s have a look at those new antipatterns!

Monolith Frontend

You have a single page application or a detached website that speaks to your beautiful microservices, you have to deploy in full for an upgrade, you are using vertical “feature” teams, Congrats, you are living the era of the monolithic frontend!

In fact you carefully structured your teams around different small sets of microservices, each one deployable individually, empowering your teams to deploy quickly and independently but, when you reach the frontend, you go back to the main model of having a big fat application, with merges all over the place and a single deployment. This basically put off all the benefits, team wise, of a microservice architecture.

The solution is of course splitting the app in small independent pieces, deployable independently, or move the entire responsability of the frontend to a separate team. Please note that none of these are as difficult as they seems.

Early mornings

You have a scheduled rithm for deployments (i.e. every week) and you deploy your services in the early morning, tipically around 6am, in order to avoid any disturbance to production across the world. You have a page that lists all the new services to be deployed (created manually) and who is the responsible for them.

The issue that you have is a lack of a critical component called continuous deployment, that would allow you, with a push of a button, to deploy all your software to production. As you do not have fully automated procedure to deploy to production, you are probably lacking other critical parts such as redudancy o high availability of your services.l

The solution is to take your time to have things in order :) Well crafted microservices allows you to deploy very fast and without particular hassle: you have to make sure first that your code supports that and then that your devops are fully onboard in providing the critical parts required, usually in form of automation scripts with any decent framework.

Microservices Antipattern

Last friday I was a speaker at the Geecon Microservices conference in Sopot. I was planning to talk about the whole thing, mentioning of course our work at Workshsare about msnos and I had a long presentation but I had only a 30 minutes slot. Also I went on stage after a shitload of good speeches about such topic, and so I decided to talk about a small niche not fully exploited at the moment, Microservices antipatterns: this is a short recap.

Ah, why I know about this stuff? Because at Workshsare we have a black monolith and when I joined in January 2013 I
immediately started pushing for Microservices :)


When we are in Disneyworld everything smells nice, cosy and lovely. Also, there are no healthchecks, no monitoring and no metrics, so you assume everything is well in your systems, all the time.

It may seem obvious to have monitoring on any system in production, but some people think that because the services are small and so simple nothing will go wrong. Of course this is silly, because we know that if something can go wrong. it will! Also small does not imply simple, as the complexity in any distributed architecture is order of degrees higher compared to a standard monolithic system.

For that reason you need each service to be able to produce a self healthcheck and metrics, possibly trough an HTTP endpoint, and surround it by some external monitoring. It won’t hurt also adding some smoke testing for each environment, possibly trough test created by you QA people. Implementation wise, dropwizard is a valid option for Java services, and we have our own small opensource implementation for Ruby.

Monolith database

In this scenario many of your microservices are sharing the same database: they expose nice different REST endpoint, but all the data ends up in the same bin.

The issue with this are multiple. First all your services are coupled to the same schema, so are your models (if you have any) and a change in the database may require you to propagate a change in your model. Furthermore, your services won’t be able to be released independently, as a database migration required by service A may/will require a change in services X/Y/Z : one of the big advantages of this architecture is out of the window.

You need each service using its own database, and when necessary they will talk to each other using your APIs. You should design your external API with this constraint in mind, moving on from the big SQL joins and welcoming asynchronous completion of tasks, a user interface that progressively updates itself, APIs that are returning link to other resources rather than embedding them.

Unknown caller

Your microservices are calling each other using their endpoints but there’s no correlation between such calls, so that each call is a new one, completely unrelated to the source of it.

When this happens there’s no easy way to track what caused a failure of a call executed on a service which fails somewhere in the chain. As there’s no (common) identification between such calls you usually end up in a endless marathon of log lurking in order to understand the reason for the failure.

The solution lies in peer collaboration: each server must inject a call id if missing , each server must propagate the call id if present, each server should log each call including the id. This will allow the correlation of the calls together, thus providing a clear chain of invocations between the services

Hardcoded hell

All the address (endpoints) of services are hardcoded somewhere, sometimes directly in the code.

Your microservices directly know about each other, so every time you need to add a new microservice you have to crack open your code or, if you are in a slightly better situation, your configuration or deployment scripts. If you are also experiencing the “All your machines are belong to us” antipattern you may be using some form of DNS or naming trickery at the machine level

You can easily build your own simple discovery mechanism, with a replicated registry which hosts service information, or you can introduce a discovery mechanism like eureka, zookeper, consul, msnos (yeah shameless plug!)

Synchronous world

Every call in your systems is synchronous, and you wait for things to actually happen before returning to the caller: you wait for the database to store the data, another service to perform an operation, and so on.

The issue with this setup is that your service will be very slow, every call can potentially hang forever and with just one malfunctioning service your whole system will be compromised

You need to use 201 and a location header as much as you can during creation operation (or 202 accepted if you fancy), using queues on top of your receivers and implementing asynchronous protocols from the start. It’s more complicated of course but in terms of performance it pays off big time.

Babel tower

You have a lot of microservices and they are using all sort of different lingo to talk to each other (i.e. REST, SOAP, IIOP, RMIP…)

The integration of a new microservice requires a lot of work, the number of adapters increase exponentially, you always consider the costs of integration before deciding to create a new microservice, thus ending up in a small amount of big services

You need of course to standardize your protocols, and introduce boundary objects only were it’s strictly required. A very successful approach consist in use REST as a synchronous protocol (point-to-point) and a lightweight messaging protocol for fully asynchronous communications, a message queue like RabbitMQ or a pub/sub like Redis

All your machines are belong to us

(a note about this one: I consider it an antipattern, but I can be very wrong, so please take it with caution)

A new virtual machine in the cloud is spinoff when a new service is provisioned and your architecture is relying on this for locating or monitoring the services

When this happens you basically have a lot of money to spend, and you are allowed to not care a lot about things like your next AWS bill. Now your architecture is now strictly connected to the money you have: if for any reason you have to shrink down you won’t be able to (timely) do so.

Make sure that your architecture is scalable without relying on metal/virtual scaling only, consider solution based on containers like Docker, CoreOS, Mesos, application servers, plain old barebone deployments. If your services are self-contained you can always run them on any machine that has the correct platform installed

JVM issue: concurrency is affected by changing the date of the system! [part 4]

I am frequently challenged about the seriousness of this bug and the impact that it has. It’s not the first time I try to explain that, because this bug affects LockSupport.parkNanos() it basically spreads like a virus across all the platform, but let’s see this more practically

$ grep -r -l "parkNanos" .

Well, it does not look that bad, does it? But uhm… I guess we are missing something… who’s using this classes? And who’s using such classes? And who’s using such classes? Omg… I am getting an headhache! Almost EVERYTHING is using this! So trust me, you will be affected. Maybe you are still not believing it, but please remember that this affects also Object:wait(:long) and so, transitively, also synchronized. Wait… WOOT? Oh yeah :) So lots of fun! Especially when your system, deployed on client premises, starts doing “strange” things and you are called by your (not very happy) support team.

Be aware that this bug is now fixed in JDK8 and I have no knowledge of any successful backports of it into JDK7.

See also
The full saga, all the articles I published on the matter:

JVM issue: concurrency is affected by changing the date of the system! [part 3]

I have been asked further information about the matter and for that reason I am pushing a bit of more code here. It’s C++, so be aware of it! For the records, we are looking at the sources of the hotspot JVM, you can find the source here:

Let’s have a look at the park() function of PlatformEvent, which is used within all synchronization primitives of the JVM:

int os::PlatformEvent::park(jlong millis) {
   struct timespec abst;
   compute_abstime(&abst, millis);

   while (_Event < 0) {
     status = os::Linux::safe_cond_timedwait(_cond, _mutex, &abst);
     if (status != 0 && WorkAroundNPTLTimedWaitHang) {
       pthread_cond_destroy (_cond);
       pthread_cond_init (_cond, NULL) ;
     assert_status(status == 0 || status == EINTR ||
                   status == ETIME || status == ETIMEDOUT,
                   status, "cond_timedwait");
     if (!FilterSpuriousWakeups) break ;                 // previous semantics
     if (status == ETIME || status == ETIMEDOUT) break ;
     // We consume and ignore EINTR and spurious wakeups.

Please look at the line in bold, where the end time to wait is computed: if you open that function (line 5480) you will notice that it’s calculating an absolute time. based on the wall clock

   static struct timespec* compute_abstime(timespec* abstime, jlong millis) {

      if (millis < 0)  millis = 0;

      struct timeval now;
      int status = gettimeofday(&now, NULL);

So what will happen is that the park function will be waiting on an absolute time based on a wall clock, hence will fail miserably if the wall clock is changed.

The simplest fix, without changing too much code, would be to use the CLOCK_MONOTONIC (or CLOCK_MONOTONIC_RAW, even better) to compute the absolute time ( clock_gettime(CLOCK_MONOTONIC, &ts) ) and also to check it the same way in the main loop (you can associate any available clock with a pthread_cond_timewait)

Then, if we really want to stay on the safe side, we should avoid using absolute delays and use relative delays, as POSIX specs explicitly guarantees that threads waiting on a relative time are not affected to changes to the underling clock, while when using absolute delays the situation is historically “fuzzy”.

Is that complex? I does not look so, at least looking at the code (I will try to patch it myself for sure) but I surely do not grasp the complexity of the whole hotspot, so I may fail miserably. It also have to be noted that my C++ skills are kind of dated :)

See also
The full saga, all the articles I published on the matter:

JVM issue: concurrency is affected by changing the date of the system! [part 2]

Based on a lot of questions I received in various mailing lists related to the previous post and in order to make the issue simpler and clearer I decided to go back to a binary deliverable (code) that shows the problem, hope this helps!

This is my PreciousPool class, that handles Precious resources:

import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

public class PreciousPool {

    public static class Precious {
        private final int id;

        private Precious() {
            this.id = 100+(int)(Math.random()*900.0);

        public String toString() {
            return "Precious n."+id;

    private final Lock lock;
    private final Condition ready;
    private final long timeoutInMillis;

    private final List preciousLended;
    private final List preciousAvailable;

    public PreciousPool(int size, long timeoutInSeconds) {
        this.lock = new ReentrantLock();
        this.ready = lock.newCondition();

        this.timeoutInMillis = 1000L*timeoutInSeconds;
        this.preciousLended =  new ArrayList();
        this.preciousAvailable = new ArrayList();

        for (int i = 0; i < size; i++) {
            preciousAvailable.add(new Precious());

    public Precious obtain()  {
        try {
            // if no precious are available we wait for the specified timeout (releasing the lock so that others can try)
            if (preciousAvailable.size() == 0) {
                try {
                    ready.await(timeoutInMillis, TimeUnit.MILLISECONDS);
                } catch (InterruptedException e) {
                    throw new RuntimeException("Somebody interrupted me!", e);

            // if a precious is available we unload it and return to the caller, otherwise null
            if (preciousAvailable.size() > 0) {
                Precious value = preciousAvailable.remove(0);
                return value;
            } else {
                return null;
        } finally {

    public void release(Precious value) {
        try {
            if (!preciousLended.remove(value))
                throw new RuntimeException("Element "+value+" was not lended!");

            // if a precious is returned we put it back and signal to anybody waiting
        } finally {

    public static void main(String args[]) {
        final int size = 3;
        final PreciousPool pool = new PreciousPool(size, 5);

        // let's exhaust the pool
        for (int i=0; i<size; i++)

        // and as we are stubborn we continuosly ask for a new one
        while(true) {

    private static void dump(Precious precious) {
        if (precious == null)
            log("I did not get my precious :(");
            log("I did get my precious! "+precious);

    private static void log(String message) {
        final String now = new SimpleDateFormat("HH:mm:ss:SSSS ").format(new Date());
        System.out.println(now + message);

So, the main is a single thread (no need for multithreading here, let’s keep it simple), that first exhaust the whole pool and then keep asking, without success, for a resource. Stubborn guy, I say, but it happens. If you run this program everything works as expected: you are greeted by a three successful Precious and then an endless list of failures, that it continuously grow. All good :)

02:34:40:0061 I did get my precious! Precious n.156
02:34:40:0062 I did get my precious! Precious n.991
02:34:40:0062 I did get my precious! Precious n.953
02:34:45:0064 I did not get my precious!
02:34:50:0065 I did not get my precious!
02:34:55:0066 I did not get my precious!
02:35:00:0067 I did not get my precious!
02:35:05:0068 I did not get my precious!

But guess what happens when, while the program is running, I change the date of my system back of one hour? Everything stops, it’s simple as that. No prints, nothing, zero, nada. Now, If it wasn’t so late, I would probably wait one hour in order to have my program restored to his normal process, but as a customer I won’t be terribly happy :)

See also
The full saga, all the articles I published on the matter: