How to offload work from Mendix to Azure?
As someone who has been working with Mendix for several years I am acutely aware of the limitations of the Mendix cloud offering. It is rigid, expensive and offers very little room for customization. So as I joined the smart digital factory team more than two years ago we decided that we want to offload computation load to another cloud provider where we will have more control and flexibility. Since then, we have successfully migrated a few key components of the factory to the azure cloud.
In this blog post, I want to share how we achieved this and offer a "guide" for other developers seeking to migrate their workloads to the cloud. To start of let me introduce the basic concepts.
Cloud and azure basics
The basic idea of the cloud is that instead of buying servers which comes with a huge upfront cost cloud providers such as Amazon, Microsoft or Google lease their servers to other companies and individuals. Over the years that basic idea has evolved and now providers offer many products and services in addition to leasing hardware. Some popular and important services from Azure are:
- virtual machines - as basic as it gets, full freedom to install and run anything
- webapp - managed web app env. with support for many languages: java, nodejs etc.
- functions - serverless offering that automatically scales and only minimal code is
- blob storage - for storing large files (>1 MB) like audio, images and video
- queue storage - send short messages for producer-consumer scenarios
- service bus - similar to queue, but faster and has better guarantees
Extra (not used in the post, but good to know):
- table storage - scalable document based storage (think MongoDB), not suitable as main DB
- kubernetes - managed cluster, great in combination with VMs
Most services come with a pay-as-you-go plan, meaning that customers are only charged for the time that the service is running. Generally any service can be started and stopped at any time.
Some important reason to move computation to the cloud are:
- scalability and elasticity - fancy words to say that the app is able to react and adapt to load changes, e.g. more user activity. Imagine that a component in your app is near (or is projected to reach) its limits and more load would cause a drop in performance or worse. In the cloud we can take advantage of their services and vast infrastructure to scale that component as needed and also independently of the rest of our app.
- control over environment - the Mendix platform gives very limited control over the server where the app is running. It is not possible at all to, for example install a binary executable, or a OS library, or a specific version of java, or another language like node. It is not even possible to configure the database. With other cloud services notably VMs, but also functions and to some extent web apps there is more freedom to configure things and install additional programs.
- costs - the Mendix platform hosting can be quite expensive and in certain cases it might be cost-efficient to move components that generate heavy load to the cloud. This is especially the case where components need a lot of resources for a short time or are subject to variable load.
VMs vs Web App vs. Function: How to choose
These are the main compute services. Here is a quick rule of thumb of when to use which one:
- Function - use this one for short, asynchronous workloads, up to 10 minutes
- WebApps - use for either long-running, or sync workloads where response time is crucial
- VMs - same as web apps, but where additional binaries or libraries are needed
Both functions and WebApps should be preferred to VMs. With them there is no need to install e.g. java or node. and deploying the app code is enough. Also, the main concerns of scalability, elasticity, availability and more all come by default or are a single toggle away.
Where VMs and WebApps are "on" all the time even if there are is no work to do (and billed accordingly) functions are more like the Mendix sandbox. When there are no requests the function goes to sleep. As soon as a request comes in the function is "awoken" to serve the request. This means it can be significantly cheaper to use a function, depending on how often it is used. However, same as with the Mendix sandbox it might take a while to wake up the function which is why they are not a good candidate for quick calls where response time needs to be under a second.
This can be overcome by opting for a so-called "warm start" function meaning that one instance of the function is always running and ready to serve requests. Naturally, this comes with higher costs. For more details see https://azure.microsoft.com/nl-nl/blog/understanding-serverless-cold-start/
Enough theory for now. Let us look at some real examples.
Example A: ATS Runner
The ATS runner is a component in ATS that handles the execution of individual test cases and orchestrates the execution of test suites. We are moving it to the cloud so that we can scale it independently of the Mendix platform. It was already scratching the limits and because of a planned new feature we expected that it will no longer be able to work with acceptable performance unless moved to the cloud and scaled up.
We decided for a web app instead of a function because:
- a single job can last up to a few hours.
- the resource needs of a job are hard to predict. Where some jobs are very small, others need a lot of memory and CPU to execute which can be prohibitively expensive. With a webapp we have better control of how jobs are picked up to utilize the resources of a machine optimally.
Example B: Model fetch
This component uses a node.js library (MendixSDK) to access a certain API and process its data. This was implemented in the cloud because in the Mendix runtime we could not run NodeJS scripts. Contrary to this, in the cloud, we could use whatever language was appropriate. We chose to use a function here because:
- the retrieve takes on average a few minutes, so it fits within the 10 minutes
- it can be asynchronous, we lose nothing by waiting a few more seconds
Common architecture patterns
Next step after deciding for a function or a webapp is to figure out how to connect them to your Mendix app. Again it helps to look at a concrete example, like the ATS Runner.
A simple hybrid architecture for the ATS runner could look like this:
In a KISS fashion, we could take the runner component (which is a piece of java code) wrap it with a spring container (server framework for java) and host it as a web app in azure. Communication between the main component and the runner would now happen over HTTP instead directly on the same machine via method calls. This is the minimal work to move a component to the cloud.
This is simple on first sight until you realize that neither the internet nor the two components are 100% reliable. Some of the things that could go wrong are:
- Remote runner is unreachable - job request could be lost.
- ATS main is unreachable - logs are not delivered.
- Internet is unreliable - jobs or logs could be lost, could arrive multiple times or out of order.
These are all solvable problems. We could introduce retries to make sure messages are not lost, confirmation to make sure they are handled, we could add some kind of sequence numbering to ensure the order and idempotency tokens to guard against duplicates. But adding all of this will complicate our code and has nothing to do with our business logic. It really sounds like a cross-cutting concert across our tech stack that should be address in a systemic way rather than throwing more code at the problem at every integration point.
This is where the other azure services such as storage come into play.
Benefits of azure queue and service bus
By using a queue storage or a service bus, many of the above problems can be eliminated:
- availability - azure queues have very high availability so messages will not be lost. Azure storage SDKs have retry mechanism and all kinds of error handling built-in.
- guaranteed delivery - the service itself guarantees that a message will be delivered at least once (and at most once) even in cases where there are multiple listeners/consumers
- retries - are built into the queue service. Every time a message is read (pop-ed) from the queue the consumer needs to delete it once it has finished fully processing it. If this delete signal is not received Azure will automatically re-insert the message back in the queue after a configurable amount of time. This can repeat for a configurable number of times. Very old messages can also be automatically deleted by using a time-to-live property.
- delivery order - service bus furthermore guarantees the order of delivery per session, where a session is a user defined construct. In the case of ATS a session is a job. Logs from a job are delivered in order, while logs from different jobs can be delivered in any order, in parallel.
But wait there is more:
- rate limiting - if a consumer has resources available (RAM and CPU) it can read more messages from the queue, and otherwise pause until resources become available. The producer can write as many messages to the queue as desired at any rate and need not concern itself with the capacity of the consumer.
- load balancing - a queue/service bus can have multiple listeners. So it can act as a self-organizing load balancer. If a consumer is reaching its limit in terms of resources, then it can pause from reading messages from the queue and let other consumers pick them up.
Limitations of queue and service bus
But it is not all roses when working with queues. The main limitation is that working with queus can slow things down. So this approach is only adequate where response times are not the main concern and can tolerate delays of a few hundred milliseconds. The queue especially can have considerable delay and, although the service bus is much better in this regard, they are still both slower than direct HTTP communication.
Another major limitation is that the message size for queues is max 64 KB (256 with premium) and for service bus it is max 256 KB (1 MB) with premium. This is not terrible, but a single image in ATS is easily larger than that. To overcome this limitation queues and service bus are often combined with blob storage. So in our case the runner uploads files (images) to blob storage and then sends the id (url) of the uploaded file in a queue message. ATS main reads information from the queue message and downloads the file as needed.
By combining all of this together we end up with the following architecture:
This can be generalized to any Mendix app and a component running in the cloud.
In this diagram we see that the Mendix runtime send requests to a request queue (or service bus). Additionally, if any large data is needed as part of the request it is stored before-hand in blob storage and linked in the message via its url.
The webapp or function in the cloud is reading requests from the queue and processing them.
The result is then send to a result queue. Again, more data can be stored in blob storage, linked in the message and later retrieved from the Mendix runtime to be used in the UI for example.
Finally, let me say that even though the post is talking about Azure these basic services are part of the offering for all popular cloud providers. So the architecture above can be applied to other clouds as well, probably by just replacing the labels.
Hope you like this post and that it helps you build better and more scalable apps!