The Cornell “Standard” AWS VPC 2.0

By Paul Allen

In a previous post, I described the standard VPC configuration we use for Cornell AWS accounts requiring network connectivity back to the campus network. This post is to share minor updates of that configuration. Differences from the original are:

  • Using AWS Direct Connect instead of a VPN to establish network connectivity between campus and AWS VPCs. Our current primary DC connection is 1Gbs, and our secondary connection is 100Mbs.
  • Continued allocation of a /22 CIDR block (1024 addresses) to the VPC, but no longer allocating all of those addresses to subnets within the VPC. This allows for future customization of the VPC without having to vacate and delete /24 subnets as was necessary for VPC customization with the original design.
  • Reducing the size of the four subnets to /26 CIDR blocks (64 addresses) instead of /24 CIDR blocks (256 addresses). This allows the flexibility described above, while still allowing /24 subnets to be created as part of VPC customizations.

Cornell Standard VPC in AWS version 2.0

Amazon AppStream 2.0 for Neuroscience Research

By Bertrand Reyna-Brainerd

It will come as news to few of you that the pace of advancement in computer technology has exceeded all expectations.  In 2013, 83.8% of American households owned at least one personal computer.[1]  In 1984, when the US census first began to gather data on computer ownership, that number was only 8.2%.  At that time, a debate was raging about the scope of personal computing.  Would every household need to own a computer, or would they remain a specialized tool for researchers and IT professionals?  Even the mere existence of that debate seems inconceivable today, and yet the era of personal computing may already be drawing to a close.  This is not due to a decline in the need for or utility of computing resources, but by the explosive growth of cloud computing.  Amazon Web Services’ AppStream 2.0 is a significant step toward “cloudification,” and toward a future in which all forms of computation are increasingly delocalized.

Our laboratory, which labors under the title of “The Laboratory for Rational Decision Making,” applies a variety of approaches in its search for patterns in human behavior and the brain activity that underlies them.  In 2016, David Garavito, a J.D./PhD graduate student at Cornell, Joe DeTello, a Cornell undergraduate, and Valerie Reyna, PhD, began a study on the effects of risk perception on sports-related concussions.  Joe is a former high school athlete who has suffered a number of concussions himself during play, and is therefore able to provide personal, valuable insights into how these injuries occur and why they are so common.

Bertrand Reyna-Brainerd
Bertrand Reyna-Brainerd
David Garavito
David Garavito
Valerie Reyna, PhD
Valerie Reyna
Joe DeTello
Joe DeTello

The team has previously conducted research for this project by physically traveling to participating high schools and carrying a stack of preconfigured laptops.  This is not an uncommon method of conducting research in the field, but it imposes several limitations (such as the cost of travel, the number of students that can be physically instructed in a room at one time, and by the number of laptops that can be purchased, carried, and maintained).  What’s more, the type of information required for the research could be easily assessed online—they only needed to collect the typical sort of demographic data that has long been gathered using online polls, as well as measures such as memory accuracy and reaction time.  Most of this legwork is performed by software that was designed to operate only in a physical, proctor-to-subject environment.

 

At that point – Joe and David came to me and asked if we could find a better solution.

 

Marty Sullivan, Cloud Engineer
Marty Sullivan

I was not initially sure if this would be possible.  In order for our data to be comparable with data gathered on one laptop at a time our applications would have to be reprogrammed from scratch to run on the web.  However, I agreed to research the problem and consulted with the IT@Cornell Cloudification Team. Marty Sullivan, Cloud Engineer, helped us discover that Amazon Web Services (AWS) and Cornell were testing AWS’s then-unreleased product: AppStream 2.0.  Amazon AppStream 2.0 is a fully-featured platform for streaming software, specifically desktop applications, through a web browser. The most attractive component is the fact that the end-users are not required to install any software – they can simply access the software via their preferred personal device.

 

Figure 1 - Screenshot of our assessment software running on an ordinary desktop.
Figure 1 – Screenshot of our assessment software running on an ordinary desktop.

 

Figure 2 - The same software running on AppStream.
Figure 2 – The same software running on AppStream.

 

AppStream 2.0 Application Selection Screen
Figure 3. AppStream 2.0 Application Selection Screen

 

The AppStream 2.0 environment has a number of advantages over a traditional remote desktop approach.  When a user connects to our system they are presented with an option to choose an application. After selecting an application, only that application and its descendants are displayed to the user.  This avoids several of the usual hassles with virtual computing, such as sandboxing users away from the full Windows desktop environment.  Instead, the system is ready-to-go at launch and the only barriers to entry are access to a web browser and an activated URL.

Once the virtual environment has been prepared, and the necessary software has been installed and configured, the applications are packaged into an instance that can then be delivered live over the web, to any number of different devices. Mac, PC, tablets and any other device that has a web browser can be used to view and interact with your applications!  This package resets to its original configuration periodically in a manner similar to how Deep Freeze operates in several of Cornell’s on-campus computer laboratories.  While periodic resets would be inconvenient in other contexts, this system is ideal for scientific work as it ensures that the choices made by one participant do not influence the environment of the next.

It is difficult to overstate the potential of this kind of technology.  Its advantages are significant and far-reaching:

  • Scalability: Because Amazon’s virtual machine handles most of the application’s processing demands, the system can “stretch” to accommodate peak load periods in ways that a local system cannot. To put it another way, if a user can run the AppStream web application, they can run the software associated with it.  This means that software that would otherwise be too large or too demanding to run on a phone, tablet, or Chromebook can now be accessed through AppStream 2.0.
  • Ease of use: End-users do not have to install new software to connect. Applications that would normally require installing long dependency chains, compilation, or complicated configuration procedures can be prepared in advance and delivered to end users in their final, working state.  AppStream 2.0 is more user friendly than traditional remote desktop approaches as well, all they need to do is follow a web link.  It is not necessary to create separate accounts for each user, but there is a mechanism for doing so.
  • Security: AppStream 2.0 can selectively determine what data to destroy or preserve between sessions. Because the virtual machine running our software is not directly accessible to users, this improves both security and ease of use (a rare pairing in information technology) over a traditional virtual approach.
  • Cross-platform access: Because applications are streamed from a native Windows platform, compatibility layers (such as WINE), partition utilities (such as Apple’s BootCamp), or virtual desktop and emulation software (such as VirtualBox or VMWare) are not needed to access Windows software on a non-Windows machine. This also avoids the performance cost and compatibility issues involved in emulation.  Another advantage is that development time can be reduced if an application is designed to be delivered through AppStream.
  • Compartmentalized deployment: Packaging applications into virtual instances enforces consistency across environments, which aids in collaborative development. In effect, AppStream instances can perform the same function as Docker, Anaconda, and other portable environment tools, but with a lower technical barrier for entry.

 

Disadvantages:

  • Latency: No matter how optimized an online system of delivery may be, streaming an application over the web will always impose additional latency over running the software locally. Software that depends on precise timing in order to operate, such as video games (or response time assessments in our case) may be negatively impacted.  In our case, we have not experienced latency above a barely-perceptible 50 ms.  Other users have reported delays of as much as 200ms, which is variable by region.
  • Compatibility: Currently AppStream 2.0 only has Windows Server 2012 instances available, and access to hardware-accelerated graphics such as OpenGL is not supported. This is expected to change in time.  In terms of the software itself, there are often additional configuration steps that are required to get a piece of software running within the Appstream environment compared to ordinary software install procedures.
  • Cost: To use AppStream 2.0 you must pay a recurrent monthly fee based on use. This includes costs associated with bandwidth, processing, and storage space.  You are likely to be presented with a bill on the order of tens to low hundreds of U.S. dollars per month, andyour costs will differ depending on use.

 

Ultimately, the impact of AppStream 2.0 will be determined by its adoption.  The largest hurdle that the product will need to clear is its cost, but its potential is enormous.  If Amazon leverages its position to purchase and deploy hardware at low rates, and passes these savings on to their customers, then it will be able to provide a better product at a lower price than the PC market.  And if Amazon does not do this another corporation will, and an end will come to the era of personal computing as we have known it.

[1] U.S. Census Bureau. (2013). Computer and Internet Use in the United States: 2013. Retrieved March 29, 2017, from https://www.census.gov/history/pdf/2013computeruse.pdf.

Redshift at Vet College

Data Vet College Style 
Using AWS Redshift, AWS Batch, NodeJS and Docker to accomplish data in the cloud
Team Members:  Craig Riecke, Daniel Sheehan and Scott Ross (sr523)

In 2015, the Animal Health Diagnostic Center at Cornell University changed its laboratory information management system implementing VetView as a replacement to the legacy system- UVIS.  With limited time, budget and resources, our team was forced to deploy our main business intelligence tool, Tableau, directly over VetView’s operational data model, which used Oracle.  While this worked initially giving our business staff access to data, we quickly realized that we were outgrowing the inherent constraints implicit on having Tableau on an operational data model.   These constraints included speed/performance issues, business logic embedded in sql, duplication of sql across various tableau dashboards and data silo’ed in that operational model.  At the end of the day, we needed something faster, reduced complexity for our report writers, and a building block for the future.

This blog post is the story of how the VMIT development team moved data to Redshift, a discussion on Redshifts capabilities and the tooling used to move data.

Redshift:

Pros:

  • It is affordable to start: I am not saying that it is affordable, period. Redshift can get expensive if you shove all your data into it, and your data is growing fast. That said, compared to its on-premises and appliances-based alternatives, Redshift is incredibly cheap to start with at $0.25 per hour or $180 per month.  We also had a 30 day free trail.
  • It is mostly PostgreSQL: Redshift has a different query execution engine, but in terms of the interface, looks like Postgres.  This means it has rich client tools and connectivity can be done via both JDBC and ODBC.
  • It plays nice with Tableau:  Because of the above, it plays very nicely with Tableau and other BI tools.   This is incredibly important for our group because we are targeting BI tools.
  • Scalability: it’s horizontally scalable. Need it to go faster? Just add more nodes…
  • AWS:  It’s part of the AWS ecosystem.  So while other tools, like Google’s BigQuery, are interesting, we have settled on AWS as our platform for infrastructure.

Cons:

  • Loading data can be tricky: The process we are using to load data into redshift requires extracting the data from our source database, creating a csv file, loading that csv file to S3, then using the COPY command to get data into redshift.  Parallel processing is very important in this step to reduce the amount of time it takes to do the ETLs.  It does take time to master this process.

Things to think about:

  • Distkey/Sortkey required for optimal performance.  This is a new concept for our developers group, and took us some time to think about it (and will require time as we move into the future)
  • Compute and storage are tied together.  There are some interesting talks about how this is an anti-pattern for data warehousing.  For us, this doesn’t matter at the moment.
  • Mutli AZ is painful

Decision:
At the end of the day, we were basically were comparing Redshift to other RDBMS solutions hosted in AWS RDS.  We settled on Redshift more because of our future needs, than our current needs. And so far, we love it.

Setting up Redshift:

One of the tenants of our developer group when building solutions is to implement infrastructure as code .  In that vein, here are the steps to build the redshift cluster along with the json configuration to make it happen:

Steps:

  1. Install the aws cli
  2. Run this command: $ aws redshift –cli-input-json `cat cluster.json`
  3. Create users / schemas

The output of the above should give you the endpoint of your cluster. Use that information to connect your SQL tool to the new Redshift cluster. Amazon recommends SQL Workbench/J.  But other tools work just as well (DBVisualizer).

* All of these steps are stored in our git repository on Bitbucket.  We can automate this using Jenkins, but in this case, we really don’t have a need to have Redshift creation be automated.  However, if we have to rebuild this cluster, we will rebuild it from these scripts.  We do not use the aws console to make changes on the fly to systems.

Writing ETL Process:

Here were our requirements for an ETL tool:

  • It had to be maintainable as code (visual coding environment were eliminated, including our Orion Health Rhapsody Interoperability engine)
  • It needed to be able to do parallelism
  • It needed to work with Docker
  • Preferably the tool/language fit our current ecosystem
  • We also wanted to handle secrets gracefully.  

So we landed on NodeJS and specifically TypeScript to do our ETL.  NodeJS is being used as our API language of choice, is easy to run in a docker container, and we also had code available for reuse.  TypeScript adds types to javascript, and easy way to make writing code more bearable then JavaScript.

Again, this was stored in our git repository.  We wrap the entire etl process into a docker container.

Using AWS Batch and Docker for the production pipeline:

The final piece of the puzzle is to setup a pipeline to get data into redshift using the ETL process.  Here we decided to use a newer service from AWS called “Batch.” Batch has these concepts baked in:

  • Jobs represent a unit of work, which are submitted to Job Queues, where they reside, and are prioritized, until they are able to be attached to a compute resource.
  • Job Definitions specify how Jobs are to be run. While each job must reference a Job Definition, many parameters can be overridden, including vCPU, memory, Mount points and container properties.
  • Job Queues store Jobs until they are ready to run. Jobs may wait in the queue while dependent Jobs are being executed or waiting for system resources to be provisioned.
  • Compute Environments include both Managed and Unmanaged environments. Managed compute environments enable you to describe your business requirements (instance types, min/max/desired vCPUs etc.) and AWS will launch and scale resources on your behalf. Unmanaged environments allow you to launch and manage your own resources such as containers.
  • Scheduler evaluates when, where and how to run Jobs that have been submitted to a Job Queue. Jobs run in approximately the order in which they are submitted as long as all dependencies on other jobs have been met.

To use, we just give Batch a docker image using AWS ECR (this started an internal debate about how we use ecr vs. CU dtr vs. docker hub — something for another day).

To kick off a batch job from aws cli for one time run:

$ aws batch submit-job –job-definition vet-dw-etl-sb –job-queue normal –job-name vet-dw-etl-sb-2016-01-10

To create a batch job on AWS:

$ aws batch create-compute-environment --cli-input-json file://compute-environment-normal.json
$ aws batch create-job-queue --cli-input-json file://job-queue-normal.json
$ aws batch register-job-definition --cli-input-json file://vet-dw-etl-sb-job-definition.json

The json file references our docker image on amazon’s docker registry, that image contains our etl code.

For now, we run this nightly but we have the option of running this at any interval.  Our etl process takes about ~40mins for 30million rows.

**If anyone is interested in seeing how json files for creating batch jobs or our redshift environment, please let us know.  Ditto for any code.