Custom EC2 hostnames and DNS entries

I’ve been doing some work with EC2 recently. I wanted to be able to bring up a server using Ansible, pre-configured with a hostname and valid, working FQDN.

There’s a few complexities to this. Unless you’re using Elastic IP addresses, EC2 instances will change public IP address on reboot, so I needed to ensure that the DNS entry of the FQDN will update if the host changes.

A common way of doing this is to bake out an AMI, pre-configured with a script that runs on boot to talk to the DNS server and create/update the entry. But you still need a way of passing the desired hostname in when you launch the instance for the first time, and you end up with your security keys baked onto a AMI, making it difficult to rotate them. And custom AMIs are fiddly – I’d prefer to use the official ones from Ubuntu or Amazon so I don’t have to bake out a new AMI on every OS point release.

I ended up with an approach that uses a combination of cloud-init, an IAM instance role, and Route 53, to set the hostname, and write a boot script to grab temporary credentials and set the DNS entry.

EC2 supports a thing called IAM Instance Roles, allowing an EC2 instance to grab temporary credentials for a role, letting it access AWS resources without hardcoding the access tokens. It does this by fetching credentials from an internal HTTP server, but if you use awscli or other official libraries, they’ll do this for you, unless you provide credentials explicitly.

In this case, we grant just enough permission to be able to update a specific zone on Route 53. I chose to put all my server DNS entries in their own zone to isolate them, but you don’t have to do that. I made a role called ‘set-internal-dns’ and gave it a policy document like this:

{
  "Statement":[
    {
      "Action":[
        "route53:ChangeResourceRecordSets",
        "route53:GetHostedZone",
        "route53:ListResourceRecordSets"
      ],
      "Effect":"Allow",
      "Resource":[
        "arn:aws:route53:::hostedzone/<ZONE_ID_HERE>"
      ]
    }
  ]
}

Next, I wrote an Ansible task to boot a machine set to that role, with a user-data string containing cloud-init config.

- name: Launch instance
  ec2:
    keypair: "{{ keypair }}"
    region: "{{ region }}"
    zone: "{{ az }}"
    image: "{{ image }}"
    instance_type: "m3.medium"
    vpc_subnet_id: "{{ vpc_subnet_id }}"
    assign_public_ip: true
    group: ['ssh_external']
    exact_count: 1
    count_tag:
      Name: "{{ item.hostname }}"
    instance_tags:
      Name: "{{ item.hostname }}"
      role: "{{ item.role }}"
      environment: "{{ item.environment }}"
    volumes:
      - device_name: /dev/sda1
        volume_size: 30
        device_type: gp2
        delete_on_termination: true
    wait: true
    instance_profile_name: set-internal-dns
    user_data: "{{ lookup('template', 'templates/user_data_route53_dns.yml.j2') }}"
  with_items:
    - hostname: "computer1", 
      fqdn: "computer1.{{ domain }}"
      role: "computation"
      environment: "production"

Ansible expects the user_data property as a string, so we load a template as a string using lookup.

cloud-init has the lowest documentation quality to software usefulness ratio I think I’ve ever seen. In combination with EC2 (and presumably other cloud services?), it allows you to pass in configuration settings, packages to install, files to upload and much more, all through a handy YAML file. But all the useful documentation about the supported settings is completely hidden away or just placeholder text, except for a huge example config.

Our user_data_route53_dns.yml.j2 template file is below. If you’re not using Ansible, the bits in the curly brackets are templated variables being set by the task above.

#cloud-config

# Set the hostname and FQDN
hostname: "{{ item.hostname }}"
fqdn: "{{ item.fqdn }}"
# Set our hostname in /etc/hosts too
manage_etc_hosts: true

# Our script below depends on this:
packages:
  - awscli

# Write a script that executes on every boot and sets a DNS entry pointing to
# this instance. This requires the instance having an appropriate IAM role set,
# so it has permission to perform the changes to Route53.
write_files:
  - content: |
      #!/bin/sh
      FQDN=`hostname -f`
      ZONE_ID="{{ zone_id }}"
      TTL=300
      SELF_META_URL="http://169.254.169.254/latest/meta-data"
      PUBLIC_DNS=$(curl ${SELF_META_URL}/public-hostname 2>/dev/null)

      cat << EOT > /tmp/aws_r53_batch.json
      {
        "Comment": "Assign AWS Public DNS as a CNAME of hostname",
        "Changes": [
          {
            "Action": "UPSERT",
            "ResourceRecordSet": {
              "Name": "${FQDN}.",
              "Type": "CNAME",
              "TTL": ${TTL},
              "ResourceRecords": [
                {
                  "Value": "${PUBLIC_DNS}"
                }
              ]
            }
          }
        ]
      }
      EOT

      aws route53 change-resource-record-sets --hosted-zone-id ${ZONE_ID} --change-batch file:///tmp/aws_r53_batch.json
      rm -f /tmp/aws_r53_batch.json
    path: /var/lib/cloud/scripts/per-boot/set_route53_dns.sh
    permissions: 0755

We’re installing our script into cloud-init’s per-boot scripts rather than anywhere else because I know cloud-init will run it on first boot, after it has been installed. If we put it in rc.d, for example, we’d still have to tell cloud-init to go and run it on first boot, so this is just one less thing to mess up. I’m already feeling pretty bad about writing JSON in a shell script in a YAML file.

When you boot the instance you should be able to tail /var/log/cloud-init-output.log and see a confirmation from the awscli script that the DNS change is pending. It can take 10-60 seconds to become available.

We’re using a CNAME to the EC2 public DNS entry because I still want to use split horizon DNS – if you look that entry up from inside your EC2/VPC network you’ll get the internal IP address.

Computers.

Surprises, small and big

OK, look, I know think pieces about Apple belong on Medium, but just watch video of the iPhone launch in 2007. The bit where Steve demos the ‘slide to unlock’ (15:14), and there’s an audible intake of breath from the audience. And then again, later, with the ‘pinch to zoom’ (33:22).

(It’s also a really funny presentation! I forgot how light hearted they used to be.)

The first iPhone astounded me, because it felt like something from 2-3 years in future. At the time, multitouch screens were reserved for table top projection surfaces (Microsoft’s original Surface project, for example), and while it seemed clear that it was going to be an important form of interaction in the future, every other device on the market had a physical keyboard or a stylus.

I didn’t even know multitouch technology was capable of being shrunk to that size, for that price. Apple had managed to not only make it happen, but had orchestrated their supply chain into producing millions of them, leaving the rest of the phone industry dumbfounded. And due to their exclusivity contracts, it would take years before they truly caught up. Incredible.

Satellite View of GCHQ Bude

It’s summer 2013, and the Snowden leaks are in full swing. All the geeks will tell you that they expected it all along, but they’re lying. No-one expected it to be that fierce, that pervasive, that explicit.

The culmination of it, for me, was when we learnt that GCHQ intercepts and stores all transatlantic network communications through their intercept station in Bude, Cornwall.

The story isn’t clear at first. All of it? That can’t possibly be right. But yes, that turns out to be 21 petabytes a day, over a rolling 3 day buffer. Around 60+ petabytes at any one time.

Storing it I can fathom – just about. You need lots of space, lots of disks, lots of money, and I guess they get a volume discount. But searching it in real-time too? Huh.

This is how people make things now

It’s always good seeing behind the scenes of stuff you love. More so if there’s good engineering involved, so I enjoyed this pair of videos from Spotify. Lots of overlap with Creativity, Inc. too, unsurprisingly. This is how people make things now.

(part 1, part 2)

PaperLater

PaperLater

This week we launched a thing I’ve been working on for the last few months, alongside a brilliant little team of colleagues and freelancers.

PaperLater lets you save the good bits of the web to print, so you can enjoy them away from the screen. If you’ve used something like Instapaper, Pocket or Readability before, it’s a bit like that, but in print.

It’s a really nice scale of product to have built. We solved some gnarly technical problems (automated layout/typography, single copy print production, content extraction), but it’s distilled into what’s really quite a small web app. There’s only a handful of pages, and we’ve tried to make the whole experience feel really light and easy. Time will tell us whether we’ve got that right, but I’m proud of that.

It’s nice to realise that we’re getting better at launching things. Little things make it easier: knowing to get nice photos shot before launch; having a customer support system to bolt into; having an existing framework for legal documents; and so on.

I’m also getting comfortable with patterns and tools that reduce the numbers of things I need to think about, and let me concentrate on building the thing at hand. I’m never changing my text editor again, for example. That’s a good feeling, and only taken a decade.

I think of PaperLater a bit like podcasts. I don’t really listen to podcasts, not because I don’t like them, but there just isn’t a podcast shaped hole in my life. But there is a PaperLater shaped hole, and we built it because our hunch is there’s one in other people’s lives too. If there’s one in yours, I hope you enjoy it.

World-building

I liked this post by Sam Stokes about ‘What Programming is Like‘.

I usually describe programming as like world-building. You imagine a cartoon world with its own laws of physics. You build the objects that inhabit that world, and rules that govern it. And then you get real people to come and play in it. You watch what they do, and see whether the right things happen. And then you fiddle with it all, and go round again.

Actually, that sounds really tedious.

Tracing errors in client-side JavaScript applications

ARTHR, Newspaper Club’s online layout tool, is what the cool kids call a ‘rich web application’. It’s a Backbone.js application that renders templated Javascript views, generated dynamically.

As you should do with any grown-up piece of software, we put in an error logging system, so if someone manages to trigger a error, we get notified of it and can try to fix it.

We’re using the window.onerror callback, which takes three arguments: the error message, the URL of the script that triggered it, and the line number. On some browsers there are two more parameters: the column (useful for heavily minified code) and the error object, but that’s a fairly recent addition to the spec and not widely supported.

It became quickly obvious that some people have pretty odd browser setups. We spent a while trying to track down errors which turned out not to be in our code, or in any code on that page anywhere. These turned out to be from two sources: JS injected by HTTP proxies, and JS injected by browser extensions.

We fixed the problems with JS injected by HTTP proxies by running the whole application over SSL/TLS. The performance impact is negligible and a whole class of errors disappeared immediately.

And we fixed the problems caused by browser extensions by ignoring all script URLs outside of our domain and that of our CDN. They’ll still cause errors that’ll be visible in the console, but our code won’t trap and log them.

We ended up with a window.onerror function that looks a bit like this:

<script>
  var errorSent = false;

  window.onerror = function (message, file, line, column, error) {
    // If the error has occurred in a file beyond our control, we don't
    // handle the error. That's up to your crazy browser extensions.
    // This isn't a particularly robust check.
    var domainRegexp = new RegExp("^https?://[^.]+\.(newspaperclub|cloudfront)\.");
    if (!domainRegexp.test(file)) {
      return false;
    }

    if (!errorSent) {
      setTimeout(function () {
        var data, stack = '()@' + file + ':' + line;

        if (ARTHR && ARTHR.log) {
          // Add the column of the exception to the stack, if available.
          if (column) {
            stack =  stack + '#' + column;
          }

          // This is what we log.
          data = {
            message: message,
            stack: stack
          };

          // If the browser supports the new error parameter, try to unpack
          // the stack and message and pass it along to the logged data too.
          if (error) {
            if (typeof error === "string") {
              data.error = error;
            } else if (error instanceof Error) {
              data.error = error.name;
              data.error_message = error.message;
              data.error_stack = error.stack;
            } else {
              // Fallback to just logging whatever we've got.
              data.error = error;
            }
          }

          ARTHR.log.error(data);
        }
      }, 10);

      setTimeout(function () {
        if (ARTHR && ARTHR.GlobalNotificationView) {
          var notification = new ARTHR.GlobalNotificationView({
            title: "Sorry, Something Went Wrong",
            description: "<p>We're sorry, something has gone wrong with ARTHR.</p><p>The Newspaper Club team has been notified, but if the problem persists, please <a href=\"http://www.newspaperclub.com/about/contact\">contact us directly</a>, and we'll try and work out what's going wrong. Otherwise, please refresh the page and ARTHR will reload with the latest changes.</p><p><a href='#' class='button' onclick='javascript:window.location.reload(true)'>Restart ARTHR</a></p>",
            noticeType: "error",
            fullScreen: true,
            disableKeyboardClose: true,
            permanent: true
          });

          notification.render().display();
        }
      }, 10);

      errorSent = true;
    }

    return true;
  };
</script>

There’s a few things going on here. Firstly, we’re only catching the error the first time the browser throws an exception, so we don’t get swamped. Secondly, we’re calling our own ARTHR.log function which switches between local and remote logging, depending on the environment. In production it logs to the server over AJAX, so an entry appears in our logging system and in some situations an email is sent out to the team.

The code that displays the error to the user, and the code that sends the log message, are both executed using setTimeout with a short interval (10ms). This makes them run asynchronously, and ensures that the failure of one to execute (due to a bug or an odd situation) doesn’t prevent the other from running.

We still get a class of errors that are difficult to trace. Dumping the entire state of the application might be helpful here, including a snapshot of the DOM, all the bound events and so on. That might be straightforward, but I’ve not looked into it.

It’s often better to try and catch exceptions deeper into the code, rather than letting window.onerror handle it, but for unknown unknowns, this is a useful tool in the debugging arsenal. If you’ve got your own version of this, or there’s a much smarter way of handling JS error tracking, I’d love to hear it.

An Act of Rebellion

I choose to believe that somewhere out there there’s a freelance book designer on a mission. Their goal is to fight the homogeneity of modern cover design, and roll back the oppression forced upon them by the Big Publishers.

But you can’t slip an act of rebellion past just any old client. You have to pick your moment and find one so incompetent that they’ve never looked at a bookshelf before.

Designs of the Year Covers

From Ben.

(I had a look at my bookshelf at home, and the only two books that do this are French and Spanish. Is this a foreign thing?)