Disaster Recovery in AWS

By Scott Ross

In this post we are going to cover considerations a unit implementing AWS should be aware of when moving into the cloud.  Developing a disaster recovery (DR) plan for the cloud encompass the same considerations that on premise DR plans.  But obviously there are differences between on premise vs. cloud environments.  These difference usually manifest themselves in questions like who is responsible, how does a unit implement a DR solution, and how does ones mindset change when thinking about DR in the cloud.

It is true that DR in the cloud is more robust then on campus (another blog post…), but with this power comes more responsibility.  More tools are at the disposal of your development/operations team, and those tools are more flexible and dynamic. There are more architectural patterns, and some of those patterns differ dramatically from on premise solutions (of which we have a long history with).  And, of course, you are now dependent on more factors – networking, amazon itself, and many, many others.

Let’s start by reintroducing what disaster recovery is, and what it is not.

Disaster Recovery (DR) is defined as the process, policies, and procedures that are related to preparing for recovery or continuation of technology infrastructure, which are vital to an organization after a natural or human induced crisis.

DR is not backups.
DR complements other high availability (HA) services, but while HA deals with disaster prevention, DR is for those times when prevention’s have failed.

Disaster Recovery planning is always a trade-off between Recovery Time Objective (RTO) and Recovery Point Objective (RPO) vs. cost/complexity.  At a very basic level, DR cannot be done without input from your business partners – they will need to be aware of Risks/Rewards, and thus will drive appropriate IT strategy for disaster recovery (in another post, we will discuss ways you can categorize your services for DR).

RTO, RPO and Data Replication are key conceptual terms when dealing with DR.  Here is a brief introduction:

Recovery Point Objective (RPO) – It is the maximum targeted period in which data might be lost from an IT service due to a major incident.

Recovery Time Objective (RTO) – Is the targeted duration of time and a service level with which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.

Data Replication – sharing information so as to ensure consistency between redundant resources.

 

AWS Basics 

  • AWS Regions
    • 9 Regions worldwide (see map)
    • Completely independent with different teams and infrastructure
  • AWS Availability Zones (AZ)
    • Each region contains 1 or more AZs
    • Physically separated, but in the same geographical location
    • Shared teams and software infrastructure
  • US-EAST-1 (region)
    • Located in Virginia
    • 5 availability zones (as of October 2016)
  • AWS is dynamic resource allocation
    • Pay for resources as your go
    • Create/Destroy resources quickly using dashboard or various CLI tools

Example AWS Disasters

  • Single-Resource Disaster

What is it?  A single resource stops functioning … (EBS, ELB, EC2..)

How to prepare?  Make sure that no single resource is a point of failure.  Use AMI’s and autoscaling to help you make stateless instances (or use docker!).  Configure RAIDs for volumes.  Use AWS managed services (like RDS).

  • Single-AZ Disaster

What is it?  A whole AZ goes down

How to prepare?  Build your system so that it’s spread across multiple AZs and can survive downtime to any single AZ.  Connect subnets in different AZs to your ELB and turn on multi-AZ for RDS.

  • Single-service Disaster

What is it?  A whole AWS service (such as RDS) goes down across the entire region.

How to prepare?  First, resist the urge to use AWS resources for everything (but remember this is a cost/complexity question as well).  Be ready to recreate your system in a different region (by using automation).  Invest in the upfront preparation to do this (VPC setup, networking, etc al that need to be in place for to turn on a region)

  • Whole-Region disaster

What is it?  A whole AWS region goes down taking all the applications running on it with it. 

How to prepare? Implement a cross region DR methodology.  Take snapshots of your instances and copy them to different regions.  Use automation tools (cloudformation/ansible/chef etc al) to define your application stack.  Copy AMIs to a different region.  Understand AWS services you are using, and their ability to be used in different regions (RDS for example)

Disaster Recovery Patterns 

  • Backup and Restore
    • Advantage
      • Simple to get started
      • Very cost effective
    • Disadvantage
      • Likely not real time
      • Downtime will occur
    • Preparation Phase
      • Take backups of current systems (and schedule these backups)
      • Use S3 as the storage mechanism
      • Describe procedure to restore
    • Process…
      • Retrieve backups from S3
      • Bring up required infrastructure
        • EC2 instances with prepared AMIs, Load balancing, etc al
      • Restore system from backup
      • Switch over to the new system
    • Objectives…
      • RTO: As long as it takes to bring up infrastructure and restore system from backups
      • RPO: From last backup
  • “Pilot Light” in different regions
    • Advantages:
      • Reduces RTP and RPO
      • Resources just need to be turned on
    • Disadvantage:
      • Cost
    • Preparation
      • Enable replication of data across regions
      • Automation of services in backup region
      • Switch over DNS to other region when downtime occurs
  • Fully working low capacity standby’s
    • Advantages:
      • Reduces RTP and RPO
      • Resources just need to be turned on
      • Can handle traffic or production
    • Disadvantage:
      • Cost
      • All necessary components are running 24/7
    • Preparation
      • Enable replication of data across regions
      • Automation of services in backup region
      • Switch over DNS to other region when downtime occurs
  • Multi-Site hot backup
    • Same as low capacity, but full scaled… 

Tooling and Consistency

There lots of tooling in the industry to help implement HA and DR.  Here, we are going to cover some of tools used on campus to build infrastructure as code and the automation that’s critical to successful DR plans.  People on campus know these tools:

  • AWS Cloudformation
  • Configuration Management: Ansible, Chef, Puppet
  • Continuous Integration/deployment: Jenkins
  • Containerization: Docker
  • Preferred Languages / AWS SDK: Ruby, Python

For more, see https://confluence.cornell.edu/display/CLOUD/Cloud+Service+Matrix

AWS services DR checklists

Finally, these are some basic checklists that might help you as you discover the best strategies to implement DR.  

RDS DR Check List:

  • Setup production instances in a multi-AZ architecture
  • Enable RDS Daily Backups
  • Create cross region Read Replication in the recovery region of your choice (remember to consider your application in this! Data, without an app, doesn’t work!)
  • Maintain DB settings in both regions (better yet – script those settings! Use infrastructure as code!)
  • Make sure you have the appropriate alerting / monitoring setup; especially to monitor sync errors between the regions
  • Schedule drills!  Create failovers!  Practice!

Basic Checklist:

  • Design for failure
  • Placing services in multiple availability zones
  • Have a de-coupled architecture, to reduce single points of failure
  • Use load balancers, autoscaling, and health monitoring rules for HA
  • Use CloudFormation templates
  • Maintain up to date AMI’s
  • Backing up critical data sets to S3
  • Running “warm” servers in other regions for recovery
  • A written plan
  • Testing that plan often! 

Some Takeaways – 

  1. Design DR (& HA) into your architecture when moving to the cloud. You can “lift and shift”, but you are likely not taking advantage of cloud and are still susceptible to the same disaster risks as on premise.
  2. Take advantage of what AWS offers you and use those as your building blocks.
  3. However, understand the impact of relying on these services and building blocks. Don’t treat AWS as a ‘black box’.
  4. Exercise your DR solution!!!! (that’s one of the many benefits of the cloud; we can do this now)

Configure Jenkins to use Cornell Shibboleth Authentication

by Brett Haranin

Introduction

At RAIS, Jenkins has become integral to our development workflow.  Developers use it throughout the day to view the output of CI build/test jobs and to deploy application changes.  The basic Jenkins user management tools worked great when only one or two developers were using it, however, as we grow our team and usage of Jenkins, we wondered if there might be a way to integrate with Cornell central authentication.  We’re happy to report that there is a SAML plugin for Jenkins that works well with the Cornell Shibboleth service.

This post will focus on the “how” of setting up Shibboleth with Jenkins.  For a deeper look at Shibboleth and how it works, see Shawn’s post here: https://blogs.cornell.edu/cloudification/2016/07/11/using-cornell-shibboleth-for-authentication-in-your-custom-application

Getting Started – Basic Test IdP Setup

Add and configure plugin

The first step is to install the Jenkins SAML plugin under Manage Jenkins -> Manage Plugins. The plugin information page is available here: https://wiki.jenkins-ci.org/display/JENKINS/SAML+Plugin. Note, v0.6 or higher is required for compatibility with Cornell Shibboleth.

2016-08-04_05-53-43

Next, let’s configure the plugin. Go to Manage Jenkins -> Configure Global Security. Scroll to “Access Control” as shown here:

2016-08-04_06-02-56

For this step, the IdP metadata field should be filled with the Cornell IdP test metadata available here: https://shibidp-test.cit.cornell.edu/idp/shibboleth.  Just copy it and paste it in the text box.

Fill in the Cornell Shibboleth attributes that best map to the Display Name, Group and Username attributes — the attributes we used are:

displayName: urn:oid:2.16.840.1.113730.3.1.241
edupersonprimaryaffiliation (for group): urn:oid:1.3.6.1.4.1.5923.1.1.1.5
uid (netid): urn:oid:0.9.2342.19200300.100.1.1

Finally, if you are using matrix/project based authorization (we recommend it!) ensure that usernames are lowercase netids.  Also, make sure that you have at least one user (presumably yours, if you are doing administration) with administrative rights to Jenkins.  After saving these settings, you will no longer be able to login with Jenkins usernames/passwords.

Save and Test

To save these settings, click “Apply”.  Then, navigate to your Jenkins instance in a fresh browser (perhaps Incognito/Private Browsing).  You should be redirected to a CUWebAuth screen for the test shibboleth instance.

2016-08-04_06-46-16

After logging in, you should be redirected back to Jenkins and your session should be mapped to the user in Jenkins that matches your netid.  There are some notable limitations to the Shibboleth test system: newer employees (hired in the last several months) will not yet be synced to the TEST directory (i.e., they won’t be able to login), and the login screen will show “TEST INSTANCE” as pictured above.  In the next section, we’ll move to the PROD system for full functionality.

Next Steps – Move to Production IdP

Now that you’ve implemented and tested against the TEST instance of Shibboleth, the next step is to register your application with Cornell Identity Management so that you can use the PROD login systems.

Generate Metadata

First, you’ll need to output the metadata for your Jenkins SAML service provider.  To do that, go back to Manage Jenkins -> Configure Global Security.  Then click “Service Provider Metadata” (highlighted below).

2016-08-04_06-02-56 2

This will output a block of XML that looks like this, which is the metadata for your Jenkins SAML service provider:

2016-08-04_20-14-01

Register metadata with Cornell Identity Management

Next, we need to register this metadata with Identity Management (IDM).

Go to: https://shibrequest.cit.cornell.edu/

Fill out the initial page of contact information and choose the scope of netid roles (Staff, Faculty, etc) that should be allowed to use the tool.  Note – ultimately Jenkins will authorize users based on their netid — the settings here are simply another gate to prevent folks that aren’t in specified roles or groups from logging in at all (i.e., their assertion will never even be sent to Jenkins).  You can safely leave this blank if you don’t have specific restrictions you want to apply at the CUWebAuth level.

On the second page, specify that the metadata has not been published to InCommon, then paste your metadata into the provided box:

2016-08-04_07-09-58

Click Next and submit your request.  Submitting the request will open a ticket with Identity Management and you should receive confirmation of this via email.  Once your application is fully registered with Cornell Shibboleth, you will receive confirmation from IDM staff (normally a quick process ~1 day).

Tell Jenkins to use PROD IdP

Finally, once you receive confirmation from IDM that your application is registered, you’ll need to update Jenkins to use the production IDP metadata.

Go here to get the PROD Shibboleth metadata: https://shibidp.cit.cornell.edu/idp/shibboleth

Then, in Jenkins, return to Manage Jenkins -> Configure Global Security.  Paste the metadata in the IdP metadata block (everything else can stay the same):

2016-08-04_06-02-35

Save, then test that you are able to login.  This time you should see the regular CUWebAuth screen, rather than the “TEST INSTANCE” version.

Advanced Setup Options/Notes

  • It is possible to configure your endpoint metadata to request additional attributes from the Cornell directory.  If you would like to map a different value for group and then use Jenkins group permissions, you can work with IDM to get the desired attribute included in SAML assertions sent to your Jenkins endpoint, then specify those attributes on the Configure Global Security screen.
  • The underlying SAML library can be very sensitive (with good reason) about any mismatch between the stated URL in a user’s browser, and the URL that the webserver believes it is running at.  For example, if you terminate SSL at an ELB, a user may be visiting https://yourjenkins/jenkins/, but the webserver will be listening on port 80 and using the http scheme (i.e., http://yourjenkins/jenkins/).  This manifests with an error like “SAML message intended destination endpoint did not match recipient endpoint”.  Generally, the fix for this is to tell the Tomcat connector that it is being proxied (proxyPort and scheme attributes).  More here: http://beckje01.com/blog/2013/02/03/saml-matching-endpoints-with-tomcat/

That’s It

Now your team should be able to login to Jenkins with their Cornell NetID and Password.  Additionally, if using DUO, access will be two-factor, which is a great improvement.

For more information about Cornell Shibboleth, see the Confluence page here: https://confluence.cornell.edu/display/SHIBBOLETH/Shibboleth+at+Cornell+Page

Docker + Puppet = Win!

by Shawn Bower

On many of our Cloudification projects we use a combination of Docker and Puppet to achieve Infrastructure as code. We use a Dockerfile to create the infrastructure; all the packages required to run the application along with the application code itself. We run puppet inside the container to put down environment specific configuration. We also use a tool called Rocker that adds some handy directives for use in our Docker file.  Most importantly Rocker adds a directive called MOUNT which is used to share volumes between builds.  This allows us to mount local disk space which is ideal for secrets which we do not want copied into the Docker image.  Rocker has to be cool so they us a default filename of Rockerfile.   Let’s take a look at a Rockerfile for one of our PHP applications, dropbox:

dorpbox-dockerfile

 

This image starts from our standard PHP image which is kept up to date a patched weekly by the Cloud Services team.  From there we enable a couple of Apache modules that are needed by this application.  Then the application is copied into the directory ‘/var/www/.’

Now we mount the a local directory that contains our ssh key and encryption keys.   After which we go into the puppet setup.  For our projects we use a masterless puppet setup which relies on the librarian puppet module.  The advantage is we do not need to run a puppet server and we can configure the node at the time we build the image.  For librarian puppet we need to use a Puppetfile, for this project is looks like this:

dropbox-puppetfile

The Puppetfile list all the modules that use wish puppet to have access to and the git path to those modules.  In our case we have a single module for the dropbox application.  Since the dropbox module is stored in a private github repository we will use ssh key we mounted earlier to access it.  In order to do this we will need to add github to our known host file.  Running the command ‘librarian-puppet install’ will read the Puppetfile and install the module into /modules.  We can then use puppet to apply the module to our image.  We can control the environment specific config to install using the “–environment” flag, you can see in our Rockerfile the variable is templated out with “{{ .environment }}”.  This will allow us to specify the environment at build time.  After puppet is run we clean up some permissions issues then copy in our image startup script.  Finally we specify the ports that should be exposed when this image is run.  The build is run with a command like “rocker -var environment=development.”

It is outside the scope of this article to detail how puppet is used, you can find details on puppet here. The puppet module is laid out like this:

dropbox-puppet-layout

The files directory is used to store static files, hier-data is used to store our environment specific config, manifest stores the puppet manifests, spec is for spec test, templates is for storing dynamically generated files and tests is for test that are run to check for compilation errors.  Under hiera-data we will find an eyaml (encrypted yaml) file for each environment.  For instance let us look at the one for dev:

dev_eyaml

You can see that the file format is that of a typical yaml file with the exception of the fields we wish to keep secret.  These are encrypted by the hiera-eyaml plugin.  Early in the Rockerfile we mounted a “keys” folder wich contains the private key to decrypt these secrets when puppet runs.  In order for the hiera-eyaml to work correctly we have to adjust the hiera config, we store the following in our puppet project:

hiera_yaml1

The backends are the order in which to prefer files, in our case we want to give precedence to eyaml.  Under the eyaml config we have to specify where the data files live as well as where to find the encryption keys.  When we run the pupp apply we have to specify the path to this config file with the “–hiera_config” flag.

With this process we can use the same basic infrastructure to build out multiple environments for the dropbox application.  Using the hiera-eyaml plugin we can store the secrets in our puppet repository safely in github as they are encrypted.  Using Rocker we can keep our keys out of the image which limits the exposure of secrets if this image were to be compromised.  Now we can either build this image on the host it will run or push it to our private repository for later distribution.  Given that the images contains secrets like the database password you should give careful consideration on where the image is stored.

How to run Jenkins in ElasticBeanstalk

by Shawn Bower

The Cloud Services team in CIT maintains docker images for common pieces of software like apache, java, tomcat, etc.  One of these images that we maintain is Cornellized Jenkins images.  This image contains Jenkins with the oracle client and Cornell OID baked in.  One of the easiest way to get up and running in AWS with this Jenkins instance is to use Elastic Beanstalk which will manage the infrastructure components.  Using Elastic Beanstalk you don’t have to worry about patching as it will manage the underlying OS of your ec2 instances.  The Cloud Services team releases patched version of Jenkins image on a weekly basis. If you want to stay current the you just need to kick off a new deploy in Elastic Beanstalk.  Let’s walk through the process of getting this image running on Elastic Beanstalk!

A.) Save Docker Hub credentials to S3

INFO:

Read about using private Docker repos with Elastic Beanstalk.

We need to make our DTR credentials available to Elastic Beanstalk, so automated deployments can pull the image from the private repository.

  1. Create an S3 bucket to hold Docker assets for your organization— we use  cu-DEPT-docker 
  2. Login to Docker docker login dtr.cucloud.net
  3. Upload the local credentials file ~/.docker/config.json to the S3 bucket cu-DEPT-docker/.dockercfg 

    Unfortunately, Elastic Beanstalk uses an older version of this file named  .dockercfg.json  The formats are slightly different. You can read about the differences here.

    For now, you’ll need to manually create  .dockercfg & upload it to the S3 bucket  cu-DEPT-docker/.dockercfg

B.) Create IAM Policy to Read The S3 Bucket

    1. Select Identity and Access Management for the AWS management consoleIAM-step-1
    2. Select Policies IAM-step-2
    3. Select “Create Policy” IAM-step-3
    4. Select “Create Your Own Policy” IAM-step-4
    5. Create a policy name “DockerCFGReadOnly,” see the example policy provided. IAM-step-5
Below is an example Policy for reading from a S3 bucket.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1466096728000",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::cu-DEPT-dockercfg"
            ]
        },
        {
            "Sid": "Stmt1466096728001",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:HeadObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::cu-DEPT-dockercfg/.dockercfg"
            ]
        }
    ]
}

 

C.) Setup the Elastic Beanstalk environment

  1. Create a Dockerrun.aws.json fileHere’s an example.
  2. {
      "AWSEBDockerrunVersion": "1",
      "Image": {
        "Name": "dtr.cucloud.net/cs/jenkins:latest"
     },
     "Ports": [
       {
         "ContainerPort": "8080"
       }
     ],
     "Authentication": {
       "Bucket": "cu-DEPT-dockercfg",
       "Key": ".dockercfg"
     },
     "Volumes": [
       {
         "HostDirectory": "/var/jenkins_home",
         "ContainerDirectory": "/var/jenkins_home"
       },
       {
         "HostDirectory": "/var/run/docker.sock",
         "ContainerDirectory": "/var/run/docker.sock"
       }
     ]
    }
    

    The Authentication section refers to the Docker Hub credentials that were saved to S3.

    The Image section refers to the Docker image that was pushed to Docker Hub.

  3. We will also need to do some setup to instance using .ebextenstions.  Create a folder called “.ebextensions” and inside that folder create a file called “instance.config”  Add the following to the file:
  4.  

    container_commands:
     01-jenkins-user:
     command: useradd -u 1000 jenkins || echo 'User already exist!'
     02-jenkins-user-groups:
     command: usermod -aG docker jenkins
     03-jenkins-home:
     command: mkdir /var/jenkins_home || echo 'Directory already exist!'
     04-changeperm:
     command: chown jenkins:jenkins /var/jenkins_home
    

     

  5. Finally create a zip file with the Dockerrun.aws.json file and the .ebextenstions folder.
    zip -r jenkins-stalk.zip Dockerrun.aws.json .ebextensions/ 
    

 

 

D.) Setup Web Server Environment

  1. Choose Docker & Load balancing, autoscaling
    create_environment
  2. Select your local zip file that we created earlier ( jenkins-stalk.zip ) as the “Source” for the application version section application
  3. Set the appropriate environment name, for example you could use jenkins-prodenvironment
  4. Complete the configuration details

    NOTE: There are several options beyond the scope of this article.

    We typically configure the following:deployment

  5. Complete the Elastic Beanstalk wizard and launch.  If you are working with a standard Cornell VPC configuration, make sure the ELB is in the two public subnets while the EC2 instances are in the private subnets.
  6. NOTE: You will encounter additional AWS features like security groups etc… These topics are beyond the scope of this article.  If presented with a check box for launching inside a VPC you should check this box.

    Create_Application

    The container will not start properly the first time. Don’t panic.  
     
    We need to attach the IAM Policy we built earlier to the instance role used by Elastic Beanstalk.jerkins-prod_-_Dashboard_and__5__Twitter

  7. Select Identity & Access Management for the AWS management console
  8. IAM-step-1

  9.  Select “Roles” then select “aws-elasticbeanstalk-ec2-role”

IAM-step-6

  • Attach the “DockerCFGReadOnly” Policy to the role IAM-step-7

 

E.) Re-run the deployment in Elastic Beanstalk.  You can just redeploy the current version.

 

  1. Now find the URL to your Jenkins environment
  2. jenkins-prod-url

  3. And launch Jenkins

jenkins-running

SUCCESS !

 

F.) (optional) Running docker command inside Jenkins

The Jenkins image comes with docker preinstalled so you can run docker build and deploys from Jenkins.  In order to use it we need to make a small tweak to the Elastic Beanstalk Configuration.  This is because we are keeping the docker version inside the image patched and on the latest commercially supported release however Elastic Beanstalk currently supports docker 1.9.1. To get things working we need to add an environment variable to use an older docker API.  First go to configurations and select the cog icon under Software Configuration.

jenkins-prod_-_Configuration
Now we need to add a new environment variable, DOCKER_API_VERSION and set its value to 1.21 .
jenkins-env-var

That is it! Now you will be able to use the docker CLI in your Jenkins jobs

 

Conclusion

Within a few minutes you can have a managed Jenkins environment hosted in AWS.
There are a few changes you may want to consider for this environment.

  • Changing the autoscaling group to min 1 and max 1 makes sense since the Jenkins state data is stored on a local volume.  Having more than one instance in the group would not be useful.
  • Also considering the state data, including job configuration, is stored on a local volume you will want to make sure to backup the EBS volume for this instances.  You could also look into a NAS solution such as Elastic File Service to store state for Jenkins, this would require a modification to /var/jenkins_home path.
  • It is strongly encouraged that an HTTP SSL listener is used for the Elastic Load Balancer(ELB) and that the HTTP listener is turned off, to avoid sending credentials in plain text.

 

The code used in this blog is available at: https://github.com/CU-CloudCollab/jenkins-stalk, Please free to use and enhance it.

If you have any questions or issues please contact the Cloud Services team at cloud-support@cornell.edu

DevOps: Failure Will Occur

by Shawn Bower

The term DevOps is thrown around so much that it is hard to pin down it’s meaning.  In my mind DevOps is about culture shift in the IT industry.  It is about breaking down silos, enhancing collaboration, and challenging fundamental design principles.  One principal that has been turned on its head because of the DevOps revolution is the no single point of failure design principle. This principle asserts simply that no single part of a system can stop the entire system from working. For example, in the Financial system the database server is a single point of failure. If it crashes we cannot continue to serve clients in any fashion.  In DevOps we accept that failure is the norm and we build our automation with that in mind.  In AWS we have many tools at our disposal like auto scaling groups, elastic load balances, multi-az RDS, dynamodb, s3, etc.  When architecting for the cloud keeping these tools in mind is paramount to your success.

When architecting a software system there are a lot of factors to balance. We want to make sure our software is working and performant as well as cost effective.  Let’s look at a simple example of building a self healing website that requires very little infrastructure and can be done for low cost.

The first piece of infrastructure we will need is something to run our site.  If its a small site we could easy run it on a t2.nano in AWS which would cost less than 5 dollars a month.  We will want to launch this instance with an IAM profile with the policy AmazonEC2RoleforSSM.  This will allow us to send commands to the ec2 instance.  We will also want to install the SSM agent, for full details please see: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-ssm-agent.html.  Once we have our site up, we will want to monitor its health. At Cornell you can get free access to the Pingdom monitoring tool.  Using Pingdom you can monitor your sites endpoint from multiple locations around the world and get alerted if your site is unreachable.  If you don’t already have a Pingdom account please send an email to cloud-support@cornell.edu.  So now that we have our site running and a Pingdom account lets set up an uptime monitor.


We are doing great!  We have a site, we are monitoring, and we will be alerted to any downtime.   We can now take this one step further and programmatically react to Pingdom alert using their custom webhook notifier.  We will have to build an endpoint for Pingdom to send the alert to.  We could use Lambda and API gateway, which is a good choice for many reasons.  If we want we could start even simpler by creating a simple Sinatra app in Ruby.

pingdom-webhook

This is a very simple bit of code that could be expanded on.  It creates an endpoint called “/webhook” which first looks for an api-key query parameter.  This application should be run using SSL/TLS as it sends the key in clear text.  That key is compared against and environment variable that should be set before the application is launched.  This shared key is a simple mechanism for security only in place to stop the random person from hitting the endpoint.  For this example it is good enough but could be vastly improved upon.  Next we look at the data that Pingdom has sent, for this example we will only react to DOWN alerts.  If we have an alert in the DOWN state then we will query a table in DynamDB that will tell us how to react to this alert.  The schema looks like:

pingdom-dynamo

  • check_id – This is the check id generated by Pingdom
  • type – is the plugin type to use to respond to the Pingom alert.  The only implemented plugin is SSM which uses Amazon’s SSM to send a command to the ec2 host.
  • instance_id – This is the instance id of the ec2 machine running our website
  • command – This is the command we want to send to the machine

We will use the the type from our Dynamo table to respond to the down alert.  The sample code I have provided only has one type which uses Amazon’s SSM service to send commands to the running ec2 instance.  The plugin code looks like:
ssm-rb

This function takes the data passed in and send the command from our Dynamo table to the instance.  The full sample code can be found at https://github.com/CU-CloudCollab/pingdom-webhook.  Please feel free to use and improve this code.  Now that we have a simple webhook app we will need to deploy it to an instance in AWS.  That instance will have to use an IAM profile that will allow it to read from our Dynamo table as well as send SSM commands.  Again we can use a t2.nano so our cost at this point is approximately 10 dollars a month.

We need to make Pingdom aware of out new web hook endpoint.  To do that navigate to “Integrations” and click “Add integration.”

pingdom-integration-step-1 The next form will ask for information about your endpoint.  You will have to provide the DNS name for this service.  While you could just use the IP of the machine its highly encourage to use a real host name with SSL.

pingdom-integration-step-2

Once you have added the integration it can be used by any of the uptime checks.  Find the check you wish to use and click the edit button

pingdom-integration-step-3

Then scroll to the bottom of the settings page and you will see a custom hooks section.  Select your hook and you are all done!

pingdom-integration-step-4

This is a simple and cost effective solution to provide self-healing to web applications.  We should always expect failure will occur and look for opportunities to mitigate it’s effects.  DevOps is about taking a wholistic approach to your application.  Looking at the infrastructure side as we did in this blog post but also looking at the application it-self.  For example move to application architectures which are stateless.  Most importantly automate everything!