Tuesday, March 15, 2016

Jenkins setup for the Astrocompute project

MonetDB is an open-source relational database management system, developed by the database architectures group of CWI, the dutch national research center for mathematics and computer science. It is used for doing database research, but also widely used in production for a large set of different applications. On of them is astronomy. The fact that MonetDB is a column store makes it very well suited for the large analytical workloads of modern astronomy. So it is not surprising that MonetDB was chosen to be part of the software pipeline of the LOFAR radio telescope.

But the work on the successor of LOFAR is already started. The Square Kilometre Array (SKA) telescope is going to be build in South-Africa and Australia, and when it becomes operational it is expected to create more data than the current internet creates in one year. In order make this possible, a lot of new software is needed. The SKA organization is working on this, in cooperation with AWS. Last year, a grants program was announced. Since we were already working on similar problems, we thought that we could make a contribution. Fortunately our proposal was accepted.

Our idea was to work on the IT infrastructure around the astronomical software. The software that will be used to process the data generated by the telescope is not one single program, but a large collection of programs that all depend on different software components, provided by a lot of different developers and organizations. And on all of the programs and components there will be constant development. So there is a real need to test every component, not only in isolation, but also as part of the total software infrastructure. The amount of data that has to be processed is so large that you cannot afford to halt the data processing because one component failed, for example because it was never tested with the version of a dependency that is used in the production environment.

A little bit of information about our project can be found at our website. Integration testing is not a new concept in IT. There are a number of products available to setup a continuous integration infrastructure. We choose to work wit Jenkins, one of the most widely used tools available. We have been running a continuous integration setup on AWS for several months now, testing hundreds of commits to several source code repositories. But for testing the software of a project at the scale of SKA, a much more complicated infrastructure is needed.

In most cases, the CI infrastructure is reasonably static. For example, MonetDB is tested evey night, testing all relevant branches on several different operating systems and a selection of hardware. An infrastructure for a project like SKA is expected to be much more dynamic. It is likely that the amount of testing would change frequently, simply because it would include software from a lot of different sources. So ideally you would automatically add or remove testing machines, based on the amount of changes that have to be tested. Running this in a cloud environment, such as Amazon Web Services, makes a lot of sense.

Typically you would have one Jenkins master machine and a number of Jenkins slave machines. The master monitors the source code repositories for changes and then schedules the actual tests to run on one of the available slave machines. But before a machine can be used as a slave to run jenkins jobs, it has to be registered on the jenkins master and it needs to be configured. For example, to test compilation of MonetDB, you need a number of packages to be installed. You need a C compiler and relevant development libraries. And because MonetDB support several client libraries, you would need Python, Ruby and some other languages. So in order to support automatic scaling of the jenkins slaves, you need to automate this process, otherwise every change in the number of slaves would require human invervention.

For automatically configurating machine, you could write you custom shell scripts. But nowadays there are several solutions available for this problem, most notably Puppet and Chef. These languages provide all the tools you need to do the complex configuration needed for the jenkins infrastructure. In order to setup a complete jenkins environment on AWS, you would like to use standard services that can easily integrate with one of these configuration languages. Luckily such service exists, Opsworks.

Although AWS opsworks was designed for hosting a website, it can also be used for other purposes. It has the option of grouping similar instances together and applying different settings for each of these groups, called layers. And it supports running configuration scripts, using Chef, at several steps during the lifetime of an instance. Properties of each instance are exposed in a Chef object, called a databags. The configuration files, called a Chef cookbook, are downloaded to each instance from a S3 bucket. And for each lifecycle event, such as starting and destroying an instance, you can select the configuration file to run, in Chef terminoligy this is called a recipe.



By running the correct recipe on both the jenkins master and the available jenkins slaves, we can setup the system in such a way that the jenkins master is allowed to login on all jenkins slave machines, by using ssh. Below is a snippet from the chef recipe that configures ssh settings on the jenkins slave:

  instance = search("aws_opsworks_instance", "self:true").first

  ssh_public_key_file = File.join(Chef::Config[:file_cache_path], 
      'cookbooks/aws_chef_jenkins/files/default/id_rsa_test.pub')
  ssh_public_key = File.read(ssh_public_key_file)

  if (instance['os'] == 'Amazon Linux 2015.09')
    append_if_no_line "ec2-user_key" do
      path "/home/ec2-user/.ssh/authorized_keys"
      line "#{ssh_public_key}"
    end
  end

In the first line, we get an object with the setting of the current machine from the available databags. Then a ssh public key is loaded into a variable from a file. This public key is then used to create a ssh authorized_keys file.

On the jenkins master, we need to add credentials to the jenkins configuration. For this we use the Jenkins community cookbook, available at https://supermarket.chef.io/cookbooks/jenkins This allows us to automatically add private keys from our custom cookbook. The secretsobject is loaded from a file that contains private information, such as passwords. This file is not checked in to the source code repository, but added to the cookbook prior to uploading it to S3.

  secretsobject['jenkins_users']['ssh_keys'].each do |sshkey|
    ssh_private_key_file = File.join(Chef::Config[:file_cache_path],
      'cookbooks/aws_chef_jenkins/files/default', sshkey['keyname'])
    ssh_private_key = File.read(ssh_private_key_file)

    jenkins_private_key_credentials "#{sshkey['name']}" do
      id "#{sshkey['id']}"
      username "#{sshkey['username']}"
      description "account #{sshkey['username']} on slave"
      private_key "#{ssh_private_key}"
    end
  end

When the configuration is finished, we can view the list of available credentials in the Jenkins UI.

The next step is to setup the jenkins slaves on the master machine. For this, the jenkins cookbook contains the relevant resource definition:

  jenkins_ssh_slave 'amzn-slave' do
    description 'Run test suites on amazon linux'
    remote_fs   '/home/ec2-user'
    labels      ['amazon-linux-2015.09']
    host        "#{instance['private_ip']}"
    user        'ec2-user'
    credentials '5cf1b49d-e886-421e-910d-01a97eba4ce1'
  end

After the entire chef recipe is finished, we have a fully functional jenkins setup:

A more complete example of a cookbook to setup jenkins is available at GitHub. Of course we do not want to setup the opsworks stack manually. For doing this automatically we use the AWS cloudformation service. Here we define all parts of the opsworks stack and how they relate. The cloudformation service will then make sure that everything is created in the right order.

AWS Cloudformation really promotes the idea of infrastructure-as-code. By provisioning our infrastructure from source files, we get several advantages. One of them is versioning. By checking our files into a source code repository, we can see how our infrastructure has changed over time. This is very usefull when solving problems in a complex setup with lots of components. And this makes it possible to setup several identical stacks, where you would use one for production and another for testing. You can then develop your infrastructure the same way as you develop your code.

An example of such a cloudformation template is available at GitHub. This is not a fully functional continuous integration pipeline for something as complex as the SKA software infrastructure. But is aims to illustratie the sort of problems such a system would encounter and an idea on how to solve these challenges. The SKA project is an extraordinary scientific endeavour that requires the best software engineering to produce exa-scale astronomical data. With this work we hope to contribute to reaching that goal.