by Eric Grysko
Introduction
In Student Services IT, we develop applications that support the student experience. Class Roster, classes.cornell.edu, was launched in late 2014 after several months of development in coordination with the Office of the University Registrar. It was deployed on-premises and faced initial challenges handling load just after launch. Facing limited options to scale, we provisioned for peak load year-round, despite predictable cyclical load following the course-enrollment calendar.
By mid-2015, Cornell had signed a contract with Amazon Web Services, and Cornell’s Cloudification Team was inviting units to collaborate and pilot use of AWS. Working with the team was refreshing, and Student Services IT dove in. We consumed training materials, attended re:Invent, threw away our old way of doing things, and began thinking cloud and DevOps.
By late 2015, we were starting on the next version of Class Roster – “Scheduler”. Display of class enrollment status (open, waitlist, closed) with near real-time data, meant we couldn’t rely on long cache lifetimes. And new scheduling features, were expected to grow peak concurrent usage significantly. We made the decision, Class Roster would be our unit’s first high-profile migration to AWS.
Key Ingredients
Cloud and DevOps meant new thinking for our unit, and AWS provided both new opportunities and many new technologies to consider. We didn’t want to simply “lift and shift” our current environment, but rather consider new ingredients for our new environment. Our current ingredient list includes:
- Ansible – Our automation engine for provisioning instances, maintaining security groups, configuring instances, deploying our apps, creating AMIs and launch configs, and more. We leverage Ansible’s dynamic inventory and keep a YAML properties file for each account settings.
- Jenkins – One instance in each account responsible for executing our Ansible playbooks as well as backup jobs.
- Performance
- ELB (“Classic”) with Route53 failover records to a highly-available sorry service.
- On-Demand instances and an Auto Scaling Group.
- Security
- Dual Account Strategy – We decided to keep separate accounts for non-prod vs. prod. We can lock prod down quite tightly without complex IAM policy documents to separate non-prod resources.
- Public/Private Subnets – Pretty standard architecture. 1 VPC in each account, with each spanning 2 AZs. Each AZ has a public and private subnet.
- NAT Gateway – Our web servers in our private subnets need to fetch package dependencies on app deploy.
- Ansible vault – Credential store for our app passwords.
- Monitoring & Logging
- CloudWatch Alarms – Alarms for CPU-usage and instance health. Alarms notify SNS topics with a standard Cornell list as well as select SMS recipients.
- Pingdom, Papertrail – Third party monitoring and log management.
- Backup
- RDS MySQL – RDS is a wonderful thing. We use automated snapshots.
- MongoDB backups – mongodump executed via a Jenkins scheduled job and stored on S3, with a lifecycle policy to expire after 60 days.
Challenges
Along the way, there were a number of small to medium challenges.
1. Central system connectivity
Our biggest connectivity hurdle was to Cornell’s IBM-hosted PeopleSoft environment. The Cloudification Services Team facilitated our discussion and we can had holes punched from our NAT Gateway’s EIP to encrypted listeners. For services still on-premises, we connect back to campus via VPN Gateway.
2. Sorry Service
Our unit had long ago setup a simple static hosted environment for the hardware load balancers to fail over to in the event our front-end became unavailable. Moving to AWS meant we needed a new way to do this. We ultimately spun up a simple highly available “StaticWeb” service which has proved useful in all sorts of other ways, but it did mean a larger project scope.
3. Ansible Module Maturity
While Ansible is a very active project, at the time of deployment, some cloud modules had notable shortcomings and/or bugs.
ec2_elb_lb – included bugs for specific paths in health checks.
ec2_tag – was unable to tag all resources needed. Since Ansible is python based, we created a simple module to fill the gap.
rds – which is based on boto, did not support gp2 volumes. We used awscli to manually convert volume types.
4. Development Pipeline/Mindset Change
Using an Auto Scaling Group meant AMIs which meant golden images. Deploying “quick fixes” to prod meant new AMI, new launch config, and updated ASG.
5. Stateless
Using an ASG with instances coming and going really meant thinking as stateless as possible.
- Long Jobs – To avoid running long-running processing jobs on ASG or Spot instances, our app’s job processor needed modifications to be able to be disabled by environmental variables injected via user init data.
- Session Store – On-premises we used long-lived VMs and filesystem based. In AWS, Class Roster now store sessions in MongoDB.
- CUWebAuth – Cornell’s custom auth module makes use of its stateful cache. This prevented us from disabling stickiness entirely, but did allow us to reduce the TTL to <30 minutes.
Planning for Load
Development occurred on t2.micro
instances, which didn’t provide any sense of what might be possible for handling load. AWS’s utility billing meant we could provision a test environment truly identical to prod at a fraction of the cost – run benchmarks and tear it down.
We spun up spot a m4.xlarge
instance with enhanced networking and used Siege with a long list of pain point pages and actions (high CPU, low cache) against the test environment – repeatedly.
Some specific areas tested:
- Instance Family – The choice was between
c4
andm4
. Testing revealed CPU constraints long before memory, soc4
it was. - Instance Type – We then tested
c4.large
versus its bigger brothers ofc4.2xlarge
all the way up toc4.8xlarge
. Performance tests revealed near linear scaling. Given no advantage to vertical scaling and a preference for horizontal scaling, we stuck toc4.large
. - Availability Zones – With our DBs living in 1 AZ, testing revealed a noticeable impact if we placed web servers in another AZ. While cross-AZ latency is very low, it was noticeable.
Ultimately we chose to run two on-demand c4.large
instances and an ASG w/spot instances using CloudWatch Alarms and scheduled events as needed. Since pre-enroll events are rigidly scheduled, we can scale out for a 2-3 hour window and then scale-in without interruption to users.
Happy Cloud Launch
In Spring 2016, Class Roster migrated to AWS and launched “Scheduler”. We’ve now been cloud-happy for more than 6 months. Recently, we had our busiest single day ever with more than 23k user sessions and 117k page views. No slow downs; 100% available.
Discussions are now beginning on the next major set of features for Class Roster. Usage will grow again. I’m excited to continue to leverage a range of AWS offerings!
Eric Grysko is a Software Engineer in Student Services IT, AWS Certified, and looking forward to Class Roster in 2017.