Windows EKS Nodes & Quick Scaling

Mark Bixler
4 min readNov 2, 2023

--

In a previous post, I went over the effects of using AWS Warm Pools for legacy windows applications.

For Windows EKS Nodes, we take that same concept but “take it up a notch” to leveraging Warm Pools for faster launching of EKS Windows Nodes.

Prior to Warm Pools, our servers were built using UserData and took ~20 min to build. Due to this, we set scaling schedules to ramp up in the morning and then back down in the evening as we couldn’t quickly respond to workload demands. This leads to extra costs of over-provisioned nodes when the auto-scaling group can be scaled-in.

The challenges of leveraging Warm Pools for EKS Nodes.

  • Migrating away from UserData to SSM Automation Docs.
  • Only joining the cluster at spin up time.
  • Mixed Instance Types for Instance Capacity availability.

Migrating away from UserData to SSM Automation Docs

Our windows EKS Nodes were originally built using EC2 UserData to configure the nodes during launch time. UserData had a few drawbacks that led us to look for alternatives methods for configuring our nodes.

1.) Limited in its size.

2.) No great versioning mechanism.

3.) Handling server reboots.

4.) Logging capabilities.

SSM Automation allows us the ability to sequentially run our various configuration steps with conditionals if there are steps we want to skip, send our logs to a central CloudWatch group, version control both the Automation RunBook itself as well as the run documents being called.

Only joining the cluster at spin up time

One of the struggles that initially came to be during the design of Warm Pools is how can we only join a node to a cluster once it’s in the “active” Auto-Scaling Group pool. To do this, we leveraged ASG Lifecycle hooks.

We have a specific SSM Run Document we created, which runs the pre-defined AWS EKS PowerShell script and we inject our own runtime parameters for our nodeGroups.

We do some abstraction and create a PowerShell module, but at it’s core, the command looks something like this

"C:\\Program Files\\Amazon\\EKS\\Start-EKSBootstrap.ps1" -EKSClusterName "my-cluster" -kubeletExtraArgs --max-pods=49 --node-labels=eks.amazonaws.com/nodegroup=windows,dedicated=windows2022,... all the other things

We have a separate SSM Automation RunBook which handles commands between Warm Pool & Active Pool, but of course instances can build directly to the Active Pool, so we have conditionals for that. Using EventBridge we can handle the source and target for an EC2 instance and base parameters on this. In this scenario or main build RunBook we can invoke the same SSM Document to join a cluster.

To do this step successfully, we needed kubectl installed on each node, so we could run a verify command that the node was actually added to the cluster.

Mixed Instance Types for Instance Capacity availability

mixed instance workflow

The final major hurdle we had to overcome was that of limited physical capacity in our AWS Local Zone. We are currently running in the LAX region, and we run some MASSIVE instance types for reasons I won’t get into here. Due to this, there are times where we want to launch an instance of m5.16xlarge type and we get an ICE event.

“If AWS doesn’t currently have enough available On-Demand capacity to complete your request, then you receive the following InsufficientInstanceCapacity error”. This is an ICE Event.

AutoScaling groups w/o Warm Pools allow you to add various alternative instance types to switch to for situations such as this. You lose that ability when Warm Pools are introduced.

To solve this, we created an EventBridge rule that looks for ICE events in our CloudTrail logs, then using a prefix for just our Windows EKS Node Groups, we filter those events. We filter so we’re not running this against every ASG, just the Windows EKS Node Groups. Here is a snippet of how this EventBridge pattern is defined in our IAC (Infrastructure as Code) tool pulumi:

const mixedInstancesEventPattern = {
source: ['aws.ec2'],
'detail-type': ['AWS API Call via CloudTrail'],
detail: {
eventSource: ['ec2.amazonaws.com'],
eventName: ['RunInstances'],
errorCode: ['Server.InsufficientInstanceCapacity'],
requestParameters: {
tagSpecificationSet: {
items: {
tags: {
key: ['aws:autoscaling:groupName'],
value: [{ prefix: winEksWorkers-1-24-2022-lax }]
}
}
}
}
}
};

This EventBridge Rule & Pattern trigger a lambda which takes in a list of alternate instance types we’d like to use. From there we perform the following steps as shown in the diagram above:

1.) Temporarily stop any scaling activities

2.) Attempt to create a new EC2 On-Demand Capacity Reservation (ODCR).

3.) If the reservation works, we immediately cancel the reservation (releasing back the capacity), then change our launch template to that available instance type.

3a.) If the reservation doesn’t work, we loop back and try another instance type.

4.) Update the Launch Template to use the new successful ODCR Instance type and set it as default.

5.) Replace any “Warmed:Stopped” Warm Pool instances to new instance type.

6.) Resume scaling activities.

“Wa-La” we are now running on a different instance type.

We send ourselves slack messages during these ICE events and show that we were successful in our new instance type.

Thus far the only drawback to this approach is that the default instance type we define in code may not match the updated / current instance type set as $DEFAULT in the launch template. This crates drift during any new infrastructure updates.

We have some potential solutions around this yet to be implemented. Of all possible instance types, each are “approved” for our needs, so we’re ok with the setup “as is”.

Summary

There you have it. We’ve taken a 20 min EKS Node Launch time down to 2.5–3 minutes leveraging AWS Warm Pools, SSM Automation and a bit of engineering of Event Bridge rules and lambdas to add in mixed instance types.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Mark Bixler
Mark Bixler

Written by Mark Bixler

Platform Architect @mindbody. Passion for automating my work and for trolling my friends

No responses yet

Write a response