Microservices Antipatterns [2]

Looking forward to the upcoming Geecon conference in Prague I am trying to identify new antipattern and I have to say that’s not that difficult, having been working with them here in Workshare for almost three years. The more we use them, the more dead road we find:) and it’s a natural darwinian process I guess: regardless of all what you can find now on the internet, sometimes it’s only trough your own experience that you learn. And, as Uncle Bob says, “you learn more from your mistakes than from your successes:). So let’s have a look at those new antipatterns!

Monolith Frontend

You have a single page application or a detached website that speaks to your beautiful microservices, you have to deploy in full for an upgrade, you are using vertical “feature” teams, Congrats, you are living the era of the monolithic frontend!

In fact you carefully structured your teams around different small sets of microservices, each one deployable individually, empowering your teams to deploy quickly and independently but, when you reach the frontend, you go back to the main model of having a big fat application, with merges all over the place and a single deployment. This basically put off all the benefits, team wise, of a microservice architecture.

The solution is of course splitting the app in small independent pieces, deployable independently, or move the entire responsability of the frontend to a separate team. Please note that none of these are as difficult as they seems.

Early mornings

You have a scheduled rithm for deployments (i.e. every week) and you deploy your services in the early morning, tipically around 6am, in order to avoid any disturbance to production across the world. You have a page that lists all the new services to be deployed (created manually) and who is the responsible for them.

The issue that you have is a lack of a critical component called continuous deployment, that would allow you, with a push of a button, to deploy all your software to production. As you do not have fully automated procedure to deploy to production, you are probably lacking other critical parts such as redudancy o high availability of your services.l

The solution is to take your time to have things in order:) Well crafted microservices allows you to deploy very fast and without particular hassle: you have to make sure first that your code supports that and then that your devops are fully onboard in providing the critical parts required, usually in form of automation scripts with any decent framework.

Microservices Antipattern

Last friday I was a speaker at the Geecon Microservices conference in Sopot. I was planning to talk about the whole thing, mentioning of course our work at Workshsare about msnos and I had a long presentation but I had only a 30 minutes slot. Also I went on stage after a shitload of good speeches about such topic, and so I decided to talk about a small niche not fully exploited at the moment, Microservices antipatterns: this is a short recap.

Ah, why I know about this stuff? Because at Workshsare we have a black monolith and when I joined in January 2013 I
immediately started pushing for Microservices:)

Disneyworld

When we are in Disneyworld everything smells nice, cosy and lovely. Also, there are no healthchecks, no monitoring and no metrics, so you assume everything is well in your systems, all the time.

It may seem obvious to have monitoring on any system in production, but some people think that because the services are small and so simple nothing will go wrong. Of course this is silly, because we know that if something can go wrong. it will! Also small does not imply simple, as the complexity in any distributed architecture is order of degrees higher compared to a standard monolithic system.

For that reason you need each service to be able to produce a self healthcheck and metrics, possibly trough an HTTP endpoint, and surround it by some external monitoring. It won’t hurt also adding some smoke testing for each environment, possibly trough test created by you QA people. Implementation wise, dropwizard is a valid option for Java services, and we have our own small opensource implementation for Ruby.

Monolith database

In this scenario many of your microservices are sharing the same database: they expose nice different REST endpoint, but all the data ends up in the same bin.

The issue with this are multiple. First all your services are coupled to the same schema, so are your models (if you have any) and a change in the database may require you to propagate a change in your model. Furthermore, your services won’t be able to be released independently, as a database migration required by service A may/will require a change in services X/Y/Z : one of the big advantages of this architecture is out of the window.

You need each service using its own database, and when necessary they will talk to each other using your APIs. You should design your external API with this constraint in mind, moving on from the big SQL joins and welcoming asynchronous completion of tasks, a user interface that progressively updates itself, APIs that are returning link to other resources rather than embedding them.

Unknown caller

Your microservices are calling each other using their endpoints but there’s no correlation between such calls, so that each call is a new one, completely unrelated to the source of it.

When this happens there’s no easy way to track what caused a failure of a call executed on a service which fails somewhere in the chain. As there’s no (common) identification between such calls you usually end up in a endless marathon of log lurking in order to understand the reason for the failure.

The solution lies in peer collaboration: each server must inject a call id if missing , each server must propagate the call id if present, each server should log each call including the id. This will allow the correlation of the calls together, thus providing a clear chain of invocations between the services

Hardcoded hell

All the address (endpoints) of services are hardcoded somewhere, sometimes directly in the code.

Your microservices directly know about each other, so every time you need to add a new microservice you have to crack open your code or, if you are in a slightly better situation, your configuration or deployment scripts. If you are also experiencing the “All your machines are belong to us” antipattern you may be using some form of DNS or naming trickery at the machine level

You can easily build your own simple discovery mechanism, with a replicated registry which hosts service information, or you can introduce a discovery mechanism like eureka, zookeper, consul, msnos (yeah shameless plug!)

Synchronous world

Every call in your systems is synchronous, and you wait for things to actually happen before returning to the caller: you wait for the database to store the data, another service to perform an operation, and so on.

The issue with this setup is that your service will be very slow, every call can potentially hang forever and with just one malfunctioning service your whole system will be compromised

You need to use 201 and a location header as much as you can during creation operation (or 202 accepted if you fancy), using queues on top of your receivers and implementing asynchronous protocols from the start. It’s more complicated of course but in terms of performance it pays off big time.

Babel tower

You have a lot of microservices and they are using all sort of different lingo to talk to each other (i.e. REST, SOAP, IIOP, RMIP…)

The integration of a new microservice requires a lot of work, the number of adapters increase exponentially, you always consider the costs of integration before deciding to create a new microservice, thus ending up in a small amount of big services

You need of course to standardize your protocols, and introduce boundary objects only were it’s strictly required. A very successful approach consist in use REST as a synchronous protocol (point-to-point) and a lightweight messaging protocol for fully asynchronous communications, a message queue like RabbitMQ or a pub/sub like Redis

All your machines are belong to us

(a note about this one: I consider it an antipattern, but I can be very wrong, so please take it with caution)

A new virtual machine in the cloud is spinoff when a new service is provisioned and your architecture is relying on this for locating or monitoring the services

When this happens you basically have a lot of money to spend, and you are allowed to not care a lot about things like your next AWS bill. Now your architecture is now strictly connected to the money you have: if for any reason you have to shrink down you won’t be able to (timely) do so.

Make sure that your architecture is scalable without relying on metal/virtual scaling only, consider solution based on containers like Docker, CoreOS, Mesos, application servers, plain old barebone deployments. If your services are self-contained you can always run them on any machine that has the correct platform installed

JVM issue: concurrency is affected by changing the date of the system! [part 4]

I am frequently challenged about the seriousness of this bug and the impact that it has. It’s not the first time I try to explain that, because this bug affects LockSupport.parkNanos() it basically spreads like a virus across all the platform, but let’s see this more practically

$ grep -r -l "parkNanos" .
./java/util/concurrent/Future.java
./java/util/concurrent/FutureTask.java
./java/util/concurrent/ThreadPoolExecutor.java
./java/util/concurrent/ScheduledThreadPoolExecutor.java
./java/util/concurrent/RunnableFuture.java
./java/util/concurrent/package-info.java
./java/util/concurrent/ExecutorCompletionService.java
./java/util/concurrent/CancellationException.java
./java/util/concurrent/RunnableScheduledFuture.java
./java/util/concurrent/CompletableFuture.java
./java/util/concurrent/AbstractExecutorService.java
./javax/swing/text/JTextComponent.java
./javax/swing/SwingWorker.java
./sun/swing/text/TextComponentPrintable.java
./sun/swing/SwingUtilities2.java

Well, it does not look that bad, does it? But uhm… I guess we are missing something… who’s using this classes? And who’s using such classes? And who’s using such classes? Omg… I am getting an headhache! Almost EVERYTHING is using this! So trust me, you will be affected. Maybe you are still not believing it, but please remember that this affects also Object:wait(:long) and so, transitively, also synchronized. Wait… WOOT? Oh yeah:) So lots of fun! Especially when your system, deployed on client premises, starts doing “strange” things and you are called by your (not very happy) support team.

Be aware that this bug is now fixed in JDK8 and I have no knowledge of any successful backports of it into JDK7.

See also
The full saga, all the articles I published on the matter:

JVM issue: concurrency is affected by changing the date of the system! [part 3]

I have been asked further information about the matter and for that reason I am pushing a bit of more code here. It’s C++, so be aware of it! For the records, we are looking at the sources of the hotspot JVM, you can find the source here:
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/aed585cafc0d/src/os/linux/vm/os_linux.cpp

Let’s have a look at the park() function of PlatformEvent, which is used within all synchronization primitives of the JVM:

int os::PlatformEvent::park(jlong millis) {
   [...]
 
   struct timespec abst;
   compute_abstime(&abst, millis);
   [...]

   while (_Event < 0) {
     status = os::Linux::safe_cond_timedwait(_cond, _mutex, &abst);
     if (status != 0 && WorkAroundNPTLTimedWaitHang) {
       pthread_cond_destroy (_cond);
       pthread_cond_init (_cond, NULL) ;
     }
     assert_status(status == 0 || status == EINTR ||
                   status == ETIME || status == ETIMEDOUT,
                   status, "cond_timedwait");
     if (!FilterSpuriousWakeups) break ;                 // previous semantics
     if (status == ETIME || status == ETIMEDOUT) break ;
     // We consume and ignore EINTR and spurious wakeups.
   }

Please look at the line in bold, where the end time to wait is computed: if you open that function (line 5480) you will notice that it’s calculating an absolute time. based on the wall clock

   [...]
   static struct timespec* compute_abstime(timespec* abstime, jlong millis) {

      if (millis < 0)  millis = 0;

      struct timeval now;
      int status = gettimeofday(&now, NULL);
      [...]

So what will happen is that the park function will be waiting on an absolute time based on a wall clock, hence will fail miserably if the wall clock is changed.

The simplest fix, without changing too much code, would be to use the CLOCK_MONOTONIC (or CLOCK_MONOTONIC_RAW, even better) to compute the absolute time ( clock_gettime(CLOCK_MONOTONIC, &ts) ) and also to check it the same way in the main loop (you can associate any available clock with a pthread_cond_timewait)

Then, if we really want to stay on the safe side, we should avoid using absolute delays and use relative delays, as POSIX specs explicitly guarantees that threads waiting on a relative time are not affected to changes to the underling clock, while when using absolute delays the situation is historically “fuzzy”.

Is that complex? I does not look so, at least looking at the code (I will try to patch it myself for sure) but I surely do not grasp the complexity of the whole hotspot, so I may fail miserably. It also have to be noted that my C++ skills are kind of dated:)

See also
The full saga, all the articles I published on the matter:

JVM issue: concurrency is affected by changing the date of the system! [part 2]

Based on a lot of questions I received in various mailing lists related to the previous post and in order to make the issue simpler and clearer I decided to go back to a binary deliverable (code) that shows the problem, hope this helps!

This is my PreciousPool class, that handles Precious resources:

import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

public class PreciousPool {

    public static class Precious {
        private final int id;

        private Precious() {
            this.id = 100+(int)(Math.random()*900.0);
        }

        public String toString() {
            return "Precious n."+id;
        }
    }

    private final Lock lock;
    private final Condition ready;
    private final long timeoutInMillis;

    private final List preciousLended;
    private final List preciousAvailable;

    public PreciousPool(int size, long timeoutInSeconds) {
        this.lock = new ReentrantLock();
        this.ready = lock.newCondition();

        this.timeoutInMillis = 1000L*timeoutInSeconds;
        this.preciousLended =  new ArrayList();
        this.preciousAvailable = new ArrayList();

        for (int i = 0; i < size; i++) {
            preciousAvailable.add(new Precious());
        }
    }

    public Precious obtain()  {
        lock.lock();
        try {
            // if no precious are available we wait for the specified timeout (releasing the lock so that others can try)
            if (preciousAvailable.size() == 0) {
                try {
                    ready.await(timeoutInMillis, TimeUnit.MILLISECONDS);
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                    throw new RuntimeException("Somebody interrupted me!", e);
                }
            }

            // if a precious is available we unload it and return to the caller, otherwise null
            if (preciousAvailable.size() > 0) {
                Precious value = preciousAvailable.remove(0);
                preciousLended.add(value);
                return value;
            } else {
                return null;
            }
        } finally {
            lock.unlock();
        }
    }

    public void release(Precious value) {
        lock.lock();
        try {
            if (!preciousLended.remove(value))
                throw new RuntimeException("Element "+value+" was not lended!");

            // if a precious is returned we put it back and signal to anybody waiting
            preciousAvailable.add(value);
            ready.signalAll();
        } finally {
            lock.unlock();
        }
    }

    public static void main(String args[]) {
        final int size = 3;
        final PreciousPool pool = new PreciousPool(size, 5);

        // let's exhaust the pool
        for (int i=0; i<size; i++)
            dump(pool.obtain());

        // and as we are stubborn we continuosly ask for a new one
        while(true) {
            dump(pool.obtain());
        }
    }

    private static void dump(Precious precious) {
        if (precious == null)
            log("I did not get my precious :(");
        else
            log("I did get my precious! "+precious);
    }

    private static void log(String message) {
        final String now = new SimpleDateFormat("HH:mm:ss:SSSS ").format(new Date());
        System.out.println(now + message);
    }
}

So, the main is a single thread (no need for multithreading here, let’s keep it simple), that first exhaust the whole pool and then keep asking, without success, for a resource. Stubborn guy, I say, but it happens. If you run this program everything works as expected: you are greeted by a three successful Precious and then an endless list of failures, that it continuously grow. All good:)

02:34:40:0061 I did get my precious! Precious n.156
02:34:40:0062 I did get my precious! Precious n.991
02:34:40:0062 I did get my precious! Precious n.953
02:34:45:0064 I did not get my precious!
02:34:50:0065 I did not get my precious!
02:34:55:0066 I did not get my precious!
02:35:00:0067 I did not get my precious!
02:35:05:0068 I did not get my precious!
[...]

But guess what happens when, while the program is running, I change the date of my system back of one hour? Everything stops, it’s simple as that. No prints, nothing, zero, nada. Now, If it wasn’t so late, I would probably wait one hour in order to have my program restored to his normal process, but as a customer I won’t be terribly happy:)

See also
The full saga, all the articles I published on the matter:

JVM issue: concurrency is affected by changing the date of the system!

Executive summary
The implementation of the concurrency primitive LockSupport.parkNanos(), the function that controls *every* concurrency primitive on the JVM, is flawed, and any NTP sync, or system time change backwards, can potentially break it with unexpected results across the board when running a 64bit JVM on Linux 64bit

What we need to do?
This is an old issue, and the bug was declared private. I somehow managed to have the bug reopened to the public, but it’s still a P4, that means that probably won’t be fixed. I think we need to push for a resolution ASAP, be sure that’s in for JDK9, make all the possible effort to make this fix for JDK8 or, at least, to include it in a later patch release. In an ideal world it would be nice to have a patch for JDK7

Why all this urgency?
If a system time change happens then all the threads parked will hang, with unpredictable/corrupted/useless results to the end user. Same applies to Future, Queue, Executor, and any other construct that it’s somehow related to concurrency. This is a big issue for us and for any near time application: please think about trading and betting, where the JVM is largely used. And please do not restrain yourself to the Java language: add Scala and any other JVM-based language to the picture.

All the details (spoiler: tech stuff!)
To be more clear about the issue, the extent of it and the concurrency library, let me introduce this very simple program:

import java.util.concurrent.locks.LockSupport;

public class Main {
  public static void main(String[] args) {
    for (int i=100; i>0; i--) {
      System.out.println(i);
      LockSupport.parkNanos(1000L*1000L*1000L);
    }
    System.out.println("Done!");
  }
}

Run it with a 64bit 1.6+ JVM on 64bit Linux, turn the clock down one hour and wait until the counter stops… magic! I tested this on JDK6, JDK7 and latest JDK8 beta running on various Ubuntu distros. It’s not just a matter of (old?) sleep() and wait() primitives, this issue it affects the whole concurrency library.

To prove that this is fixable, I reimplemented the program above above substituting LockSupport.parkNanos() with a JNI call toclock_nanosleep(CLOCK_MONOTONIC…): works like a charmūüė¶
This is due to the fact that the CPP code is calling the pthread_cond_timedwait() using its default clock (CLOCK_REALTIME) which, unfortunately is affected by settime()/settimeofday() calls (on Linux): for that reason it cannot be used to measure nanoseconds delays, which is what the specification requires. CLOCK_REALTIME is not guaranteed to monotonically count as this is the actual “system time”: each time my system syncs time using a NTP server on the net, the time might jump forward or backward. The correct call (again on Linux) would require to use CLOCK_MONOTONIC as clock id, which are defined by POSIX specs since 2002. (or better CLOCK_MONOTONIC_RAW)

The POSIX spec is infact clear, as it states “…setting the value of the CLOCK_REALTIME clock via clock_settime() shall have no effect on threads that are blocked waiting for a relative time service based upon this clock…”: it definitely states “relative”. Having a look at the hotspot code, it appears that the park() is using compute_abstime() (which uses timeofday) and then waits on an absolute period: for that reason it’s influenced by the system clock change. Very wrong.

Next steps?
I am trying to raise the awareness of this issue, basically involving as much people as I can. I will continue to do that and escalate as soon as I get some solid response from Oracle. I am also working on a patch myself.

See also
The full saga, all the articles I published on the matter:

Gitinabox, get your private git!

The long wait is over! GerritForge LLP today announced the availability of the first private beta of Git-in-a-box, download it today!

gitinaboxGit-in-a-box is the revolutionary HTML5 User-Experience to get started using Git on your premises with a ‚Äúplug & play‚ÄĚ Server installation. Take a tour of the features available and check out the howto section. No install, no hassle: just double click on the java executable jar and you will have a fully featured Git backend, fully managed trough a nice web interface, while enjoying http and ssh access to your repositories from your network!

What you can do with it?

  • create and manage your repositories with simple clicks
  • create and manage users and groups with no dependencies
  • control security and access to your code
  • access statistics and browsing
  • enjoy a pleasant user interface on your phone, tablet or computer

Git-in-a-box is currently in private beta till end of March 2013. With all the feedback received and the stabilisation fixed, the final release of Git-in-a-box will be officially available for Download from April 2013.

Check it out at gitinabox.com!