Scaling Legacy Windows Applications in AWS
It’s been quite some time since I’ve posted something on Medium, that isn’t to say I’ve been lazy. Quite the opposite! I’ve recently completed a project at work that I felt could have a lot of value to others out there who were in the same position as I was. We needed to scale Legacy Windows Applications with Domain Join requirements. This post highlights how I did this and ended up with 30 second scaling time of legacy software.
Goal: Enable Auto-Scaling for our Legacy Windows Applications.
Problem: There are many configuration steps needed before one of these instances can even be added to an ASG. I’ll list the [major] steps here:
- Install Package Managers (ie. Chocolatey)
- Join a Domain
- Configure Logging & Logging Agents
- Install IIS & Configure per Application
- Install & Configure Security Agents
- Deploy Code
To perform the steps listed above, our spin up time of a new box takes approximately 35 minutes. Of course this is no good when considering Auto-Scaling. The end-user would feel an impact for a minimum of 35 minutes and at that point, the issue might be resolved or most likely snow-balled and is now even worse waiting on more 35 minute hosts to come online to battle the workload.
Possible Solutions: I explored many different possible solutions to this problem. Most folks would probably think of going with pre-built image that includes all of the dependencies in advance to save on that build time. I had considered this option, but the domain join of a new instance is what kept this being a viable option. With the required reboot after join and computer rename we were still looking at 6 minutes. Although 6 minutes is a massive improvement over 35, it’s still not acceptable for an auto-scaling event. I was looking for sub minute availability.
The Real Solution: AWS Warm Pools. You can read all about them, but essentially Warm Pools allow a separate pool of instances along side of your active ASG. The Warm Pools instances are fully configured, domain joined instances that are ready to be “called into action” whenever your active ASG has a scaling event. This eliminates the need for the active ASG to build a new box from scratch. You can specify if you want your Warm Pool instance in a “stopped” or “started” state (although I don’t really understand the need of having a warm pool instance “started” and costing money, might as well just add it to the active asg!). We keep our warm pool instances stopped :) Let’s hit you with a diagram right off the bat!
So what does this look like?! Well the warm pool has requirements just like your active ASG, where it wants a minimum & maximum number of instances and it will build or destroy instances to maintain this number.
When a new instance is added to the ASG, a lifecycle hook is invoked and we trigger an SSM Automation RunBook. The RunBook allows us to perform all our configuration steps in a nice organized manner, that helps with error reporting and logging. Below is a sample RunBook written in yaml
that we’re using in our project. We create our infrastructure using Pulumi & typescript. Also, this template utilizes the handlebars package in order to inject parameters from our code. Parameters shown with a leading \{{ example }}
have this so the param is escaped during the handlebars run and the {{ example }}
param is injected at SSM runtime and NOT during the Pulumi run.
description: |
*Configuration for Instances Entering ASG*
---
# Steps
1. Wait for SSM Agent
2. Install Chocolatey
3. Pre Domain Join Config
4. Join Domain
5. Configure Reporting
6. Branch Logging
7. (Conditional) Configure Logging
8. Branch IIS
9. (Conditional) Install & Configure IIS
10. Install and configure Security Agents
11. Install CodeDeploy Agent
12. Deploy Latest Code
schemaVersion: '0.3'
assumeRole: '\{{ AutomationAssumeRole }}'
parameters:
AutomationAssumeRole:
type: String
default: "{{ assumeRoleArn }}"
description: (Required) The ARN of the role that allows automation to perform the actions on your behalf.
InstanceId:
type: String
description: (Required) AMI Source EC2 instance ID
configLogging:
type: String
description: (Required) Determine if Logstash should be Updated
default: "{{ configLogstash }}"
configIIS:
type: String
description: (Required) Determine if IIS Needs to be Configured
default: "{{ configIIS }}"
mainSteps:
## STEP 1 ##########################################################
- name: "Wait_for_SSM_Agent"
description: SSM Agent Needs to be Ready
action: aws:waitForAwsResourceProperty
timeoutSeconds: 3600
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
inputs:
Service: ssm
Api: DescribeInstanceInformation
InstanceInformationFilterList:
-
key: InstanceIds
valueSet: ['\{{ InstanceId }}']
PropertySelector: "$..PingStatus"
DesiredValues:
- Online
isCritical: 'true'
nextStep: Install_Chocolatey
## STEP 2 ##########################################################
- name: Install_Chocolatey
description: Download & Install Chocolatey from S3
action: 'aws:runCommand'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
maxAttempts: 3
inputs:
DocumentName: 'ChocoOfflineInstall'
InstanceIds:
- '\{{ InstanceId }}'
Parameters:
sourceInfo: "{\"path\":\"https://{{ exampleIacFilesBucket }}.s3-us-west-2.amazonaws.com/choco\"}"
nextStep: Pre_Domain_Join_Config
## STEP 3 ##########################################################
- name: Pre_Domain_Join_Config
description: Run Config Steps Prior to Joining Domain
action: 'aws:runCommand'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
inputs:
DocumentName: 'examplePreDomainJoinConfig'
InstanceIds:
- '\{{ InstanceId }}'
nextStep: Join_Domain
## STEP 4 ##########################################################
- name: Join_Domain
description: Join_Instances_to_Domain
action: 'aws:runCommand'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
inputs:
DocumentName: 'exampleDomainJoin'
InstanceIds:
- '\{{ InstanceId }}'
OutputS3BucketName: exampleSsmRunbookLogs
OutputS3KeyPrefix: 'instanceConfigRunbook/join-domain'
nextStep: Configure_Reporting
## STEP 5 ##########################################################
- name: Configure_Reporting
description: Install & Configure Reporting Agents
action: 'aws:runCommand'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
maxAttempts: 3
inputs:
DocumentName: 'exampleConfigureReporting'
InstanceIds:
- '\{{ InstanceId }}'
Parameters:
reportingEnvironment: '{{ reportingEnvParameter }}'
sourceInfo: "{\"path\":\"https://{{ exampleIacFilesBucket }}.s3-us-west-2.amazonaws.com/reporting\"}"
nextStep: Branch_Logging
## STEP 6 ##########################################################
- name: Branch_Logging
action: aws:branch
inputs:
Choices:
- NextStep: Configure_Logging
Variable: '\{{ configLogging }}'
StringEquals: "true"
- NextStep: Branch_IIS
Variable: '\{{ configLogging }}'
StringEquals: "false"
## STEP 7 ##########################################################
- name: Configure_Logging
description: Update the Logging Configuration
action: 'aws:runCommand'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
maxAttempts: 3
inputs:
DocumentName: 'exampleConfigureLogging'
InstanceIds:
- '\{{ InstanceId }}'
Parameters:
sourceInfo: "{\"path\":\"https://{{ exampleIacFilesBucket }}.s3-us-west-2.amazonaws.com/logging\"}"
env: {{ env }}
OutputS3BucketName: exampleSsmRunbookLogs
OutputS3KeyPrefix: 'instanceConfigRunbook/configure-logging'
nextStep: Branch_IIS
## STEP 8 ##########################################################
- name: Branch_IIS
action: aws:branch
inputs:
Choices:
- NextStep: Install_Configure_IIS
Variable: '\{{ configIIS }}'
StringEquals: "true"
- NextStep: Configure_Security_Agents
Variable: '\{{ configIIS }}'
StringEquals: "false"
## STEP 9 ##########################################################
- name: Install_Configure_IIS
description: Install and Configure IIS
action: 'aws:runCommand'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
maxAttempts: 1
inputs:
DocumentName: 'exampleConfigureIIS'
InstanceIds:
- '\{{ InstanceId }}'
Parameters:
sourceInfo: "{\"path\":\"https://{{ exampleIacFilesBucket }}.s3-us-west-2.amazonaws.com/iis\"}"
siteName: '{{ siteName }}'
OutputS3BucketName: exampleSsmRunbookLogs
OutputS3KeyPrefix: 'instanceConfigRunbook/install-configure-iis'
nextStep: Configure_Security_Agents
## STEP 10 #########################################################
- name: Configure_Security_Agents
description: Install & Configure Security Agents
action: 'aws:runCommand'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
maxAttempts: 3
inputs:
DocumentName: 'exampleSecurityAgentsInstall'
InstanceIds:
- '\{{ InstanceId }}'
OutputS3BucketName: exampleSsmRunbookLogs
OutputS3KeyPrefix: 'instanceConfigRunbook/configure-security-agents'
nextStep: Install_CodeDeploy_Agent
## STEP 11 #########################################################
- name: Install_CodeDeploy_Agent
description: Install CodeDeploy Agent
action: 'aws:runCommand'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
maxAttempts: 3
inputs:
DocumentName: AWS-ConfigureAWSPackage
InstanceIds:
- '\{{ InstanceId }}'
Parameters:
action:
- Install
installationType:
- In-place update
name:
- AWSCodeDeployAgent
nextStep: Deploy_Code
## STEP 12 #########################################################
- name: Deploy_Code
description: Deploy Code to new Instances
action: 'aws:runCommand'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
timeoutSeconds: 1800
inputs:
DocumentName: 'exampleCodeDeployDeployment'
InstanceIds:
- '\{{ InstanceId }}'
Parameters:
instanceId: '\{{ InstanceId }}'
asgName: '{{ asgName }}'
s3DeploymentBucket: '{{ s3DeploymentBucket }}'
codeDeployName: '{{ codeDeployName }}'
codeDeployRoleArn: '{{ codeDeployRoleArn }}'
nextStep: Return_Complete
####################################################################
- name: Return_Complete
description: Send Completed LifecycleHook
action: aws:executeAwsApi
isEnd: true
inputs:
Service: autoscaling
Api: CompleteLifecycleAction
LifecycleActionResult: CONTINUE
AutoScalingGroupName: {{ asgName }}
InstanceId: '\{{ InstanceId }}'
LifecycleHookName: {{ lchName }}
####################################################################
- name: Return_Failure
description: Send Failure LifecycleHook
action: aws:executeAwsApi
isEnd: true
inputs:
Service: autoscaling
Api: CompleteLifecycleAction
LifecycleActionResult: ABANDON
AutoScalingGroupName: {{ asgName }}
InstanceId: '\{{ InstanceId }}'
LifecycleHookName: {{ lchName }}
More Issues! Now that we have fully configured, domain joined boxes, sitting in a “stopped” state in our Warm Pool, ready for action to be called into service by the active ASG, what do we do about stale code?!
One of the draw-backs of Warm Pool is that the instances are NOT part of a CodeDeploy deployment of the main active ASG. This can lead to stale code if a recent deployment has taken place between a warm pool instance spinning up and being called into service.
I solved this with a series of lifecycleHooks and lambda events.
The first step to accomplishing this is when a server is first built, using the RunBook referenced above, we create a unique CodeDeploy deployment group in the Application, using the instance name as the group name. Using this single instance deployment group name, when the instance first spins up, we run commands to get the last successful code revision of the main deployment group, and we run a codeDeploy against that code revision so the new warm pool instance matches the active ASG instances.
When a new CodeDeploy takes place agains the main ASG, upon a successful completion of that deployment, an eventBridge rule is triggered and we invoke a lambda which “starts” all our warm pool instances and does a deploy to each of their uniquely named deployment groups.
Why didn’t you just do a deploy when the box enters the ASG?
Great question, but our deployments take 10–15 minutes PER BOX and that was not a viable solution again for scaling activities.
So now, if you’re still following along and not bored, we have Warm Pool instances that will stay up-to-date with our Active ASG instances.
But wait, what happens in the rare case that a scaling event takes place DURING a CodeDeploy of the main deployment group?!
Another great question! I’ve solved with this invoking a lambda that checks the instance when going from Warm Pool to Active Pool and checks the code to see if it matches the active boxes. If it does, the instance is put right into service. If not, we DO perform a codeDeploy. In this extremely rare case, we’d rather have GOOD code on the new box rather than a speedy box with errors.
It is worth mentioning that in the above scenario, AWS will query boxes in the active ASG and provide a final fail-safe to update any instances with out-of-date code.
PHEW! This was a lot. We now have instances that can be FULLY built and ready for action in roughly 30 SECONDS!!!
So now that we have instances built, what happens when they terminate?!
We have a solution of using a RunBook that is invoked when an ASG Termination lifecycleHook is triggered!
Let’s take a look that that. Steps we perform:
- We clear tags of the instance, as we use a 3rd party load balancer that queries for certain ASG tags to add AWS instances to the proper load balancing pool
- Remove instance from Active Directory
- Remove from Load Balancer
- Clean-up those uniquely named CodeDeployment Groups.
description: |
*Configuration for Instances Terminating from ASG*
---
# Steps
1. Get AD Management Instance ID
2. Wait for SSM Agent
3. Remove F5 Tags
4. Remove AD Object
5. Remove F5 Node
6. Cleanup CodeDeploy Group
schemaVersion: '0.3'
assumeRole: '\{{ AutomationAssumeRole }}'
parameters:
AutomationAssumeRole:
type: String
default: "{{ assumeRoleArn }}"
description: (Required) The ARN of the role that allows automation to perform the actions on your behalf.
InstanceId:
type: String
description: (Required) AMI Source EC2 instance ID
AdMgmtName:
default: "{{ adMgmtName }}"
type: String
description: Tag Name of Active Directory Management Instance
mainSteps:
## STEP 1 ##########################################################
- name: Describe_Management_Instance
description: Get AD MGMT Box Instance ID
action: aws:executeAwsApi
timeoutSeconds: 60
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
inputs:
Service: ec2
Api: DescribeInstances
Filters:
- Name: tag:Name
Values: ['\{{ AdMgmtName }}']
outputs:
- Name: InstanceIds
Selector: "$.Reservations..Instances..InstanceId"
Type: StringList
nextStep: Wait_for_SSM_Agent
## STEP 2 ##########################################################
- name: Wait_for_SSM_Agent
description: SSM Agent Needs to be Ready
action: aws:waitForAwsResourceProperty
timeoutSeconds: 120
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
inputs:
Service: ssm
Api: DescribeInstanceInformation
InstanceInformationFilterList:
-
key: InstanceIds
valueSet: ['\{{ Describe_Management_Instance.InstanceIds }}']
PropertySelector: "$..PingStatus"
DesiredValues:
- Online
isCritical: 'true'
nextStep: Remove_LB_Tags
## STEP 3 #########################################################
- name: Remove_LB_Tags
description: Set Current LB Tags to Null
action: 'aws:createTags'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
maxAttempts: 1
inputs:
ResourceType: EC2
ResourceIds:
- '\{{ InstanceId }}'
Tags:
- Key: t_asg
Value: 'null'
nextStep: Remove_AD_Object
## STEP 4 ##########################################################
- name: Remove_AD_Object
description: Remove Terminated Host Object from Active Directory
action: 'aws:runCommand'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
maxAttempts: 1
inputs:
DocumentName: 'exampleAdObjectRemoval'
InstanceIds:
- '\{{ Describe_Management_Instance.InstanceIds }}'
Parameters:
InstanceId: '\{{ InstanceId }}'
env: {{ env }}
app: {{ app }}
OutputS3BucketName: exampleSsmRunbookLogs
OutputS3KeyPrefix: 'instanceTerminationRunbook/configure-security-agents'
nextStep: LB_Node_Removal
## STEP 5 ##########################################################
- name: LB_Node_Removal
description: Remove Node from Load Balancer
action: 'aws:invokeLambdaFunction'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
maxAttempts: 1
timeoutSeconds: 120
inputs:
FunctionName: 'example-lb-node-removal-lambda'
InputPayload:
instance_id: '\{{ InstanceId }}'
lb_path: {{ lbPath }}
lb_pool: {{ lbPool }}
service_account: {{ serviceAccount }}
service_port: {{ servicePort }}
env: {{ env }}
nextStep: CodeDeploy_Cleanup
## STEP 6 ##########################################################
- name: CodeDeploy_Cleanup
description: Cleanup Unique Instance Deployment Group
action: 'aws:invokeLambdaFunction'
onFailure: 'step:Return_Failure'
onCancel: 'step:Return_Failure'
maxAttempts: 1
timeoutSeconds: 120
inputs:
FunctionName: 'example-code-deploy-instance-removal-lambda'
Payload: '{"applicationName":"{{ codeDeployName }}", "instanceId":"\{{ InstanceId }}"}'
nextStep: Return_Complete
####################################################################
- name: Return_Complete
description: Send Completed LifecycleHook
action: aws:executeAwsApi
isEnd: true
inputs:
Service: autoscaling
Api: CompleteLifecycleAction
LifecycleActionResult: CONTINUE
AutoScalingGroupName: {{ asgName }}
InstanceId: '\{{ InstanceId }}'
LifecycleHookName: {{ lchName }}
####################################################################
- name: Return_Failure
description: Send Failure LifecycleHook
action: aws:executeAwsApi
isEnd: true
inputs:
Service: autoscaling
Api: CompleteLifecycleAction
LifecycleActionResult: ABANDON
AutoScalingGroupName: {{ asgName }}
InstanceId: '\{{ InstanceId }}'
LifecycleHookName: {{ lchName }}
The active directory removal was a tricky one. We have instances that we use as management boxes which are “always-on” instances. We have an SSM agent on that box and pull down a AD Credential from SSM Parameter Store which has domain privileges to remove computer objects and we inject those credentials in a powershell doc to remove the computer object.
Write-Host 'Removing AD Object...';..we do a bunch of stuff here to generate the proper computer name from our aws instance ID.$adComputer ="$svc_name-$uid".ToUpper()$ssmParamName = "/AD/ServiceAccount/sa.example"$adPwd = (Get-SSMParameterValue -Name "$ssmParamName" -WithDecryption $true).Parameters.Value$adUser = "domain\sa.example"$securePwd= $adPwd | ConvertTo-SecureString -asPlainText -Force$adCredential = New-Object System.Management.Automation.PSCredential($adUser, $securePwd)Get-ADComputer $adComputer | Remove-ADComputer -Credential $adCredential -Confirm:$false
Of course, the code shown here is all sample to get you started. We have more detailed error checking and handling.
So to recap Auto-Scaling Legacy Windows Apps and the EXTRA AWS Resources needed:
- WARM POOLS
- SSM Automation RunBooks
- LifecycleHooks
- EventBridge Rules
- Lambdas to perform events
If you have any questions on all the work behind the scenes to complete this, I’d be happy to share more code examples on accomplishing all that we did.
We now have Auto-Scaling for 10+ legacy Windows applications which they said could never be done and we do it in under a minute!!