100% Integration Reliability

This topic discusses reasons you might not be able to connect to Zencoder and how to ensure reliable integration.

Overview

Zencoder is an essential software dependency for most of our customers. And while we aim at 100% uptime, there may be times when you can't connect to Zencoder:

When this happens and Zencoder is down, your application will typically get a '503 Service Unavailable' response from Zencoder, but you could get a different error (like a 500). If you have exceeded your API rate limit, you will get a '403 Rate Limit Exceeded' response.

The good news: since video encoding is an asynchronous process, you can build your application to never experience downtime or problems related to our availability. If you do this, the worst case scenario is that your jobs take a bit longer. But no errors occur. We highly recommend that you do this.

To put it more strongly, if you care about reliability, you should follow this approach to integration - for Zencoder, or for any critical API that you integrate with.

  • Our service might be affected by problems at an upstream provider (e.g. Amazon Web Services)
  • We occasionally need to perform system maintenance that requires temporary downtime
  • You have exceeded your API rate limit
  • etc.

Reliable app integration

  1. Include a Secondary URL as a backup in case upload to your primary location fails.
  2. If you get a non-successful response code from Zencoder - basically, something other than a 200 or 201 - don't fail the job. A response code of 503 doesn't mean that your video can't be processed. It just means that Zencoder is temporarily unavailable.
  3. If you get a connection error when trying to connect to Zencoder, do the same thing.
  4. Similarly, wrap your API requests in a timeout. We recommend a 30 second timeout; Zencoder usually responds in less than a second, so 30 seconds is usually plenty of time.
  5. In all three of these cases - if you get a non-successful response code, can't connect, or the API request times out - flag the job as 'pending'.
  6. Periodically, resubmit any jobs in the 'pending' state. You could use cron to do this every minute, for instance.

Once the jobs are resubmitted, everything behaves like normal. This way, a failed job submission only makes the job take a little longer rather than causing trouble for your application or your users.

Pseudocode

OK, so this isn't Pseudocode - it's Ruby. But Ruby is pretty easy to read.

  1. Imagine a Videos table that includes these columns. (It will obviously have more, including columns to store a Zencoder job ID and a Zencoder output file ID.)
    create_table :videos do |t|
    t.string  :state
    t.integer :lock_version
    t.index   :state
    end
  2. A Video should include a state machine with the following states:
    • pending (not yet submitted to Zencoder)
    • submitting (currently submitting to Zencoder)
    • transcoding (successfully submitted to Zencoder)
    • finished (Zencoder finished transcoding, and the job is done)
    • failed (Zencoder was unable to transcode the video)
  3. When a new video is ingested, save the video in the 'submitting' state and trigger a background job to submit the video to Zencoder.
    # got a new video!
    video = Video.new(params)
    video.state = "submitting"
    video.save!
    submit_to_zencoder(video)

    You really should background the submit_to_zencoder method. In Ruby, using DelayedJob, this might look like this:

    delay.submit_to_zencoder(video)

    But we'll stick with our submit_to_zencoder(video) method for example purposes.

  4. The submit_to_zencoder function looks something like this. This should be run asynchronously, in the background.
    def submit_to_zencoder(video)
    begin
    response = Zencoder::Job.create(attributes, :timeout => 30_000)
    if response.code == 201
    video.state = "transcoding"
    else
    video.state = "pending"
    end
    
    video.save!
    
    # Rescue any connection error. Our plugin abstracts these as
    # Zencoder::HTTPError.
    #
    # If you're not using the Zencoder plugin, this includes things
    # like Errno::ECONNRESET, Errno::ETIMEDOUT, Errno::ECONNREFUSED,
    # Errno::EHOSTDOWN, and SocketError.
    
    rescue Timeout::Error, Zencoder::HTTPError
    video.state = "pending"
    video.save!
    end
    end
  5. Every so often - e.g. every minute - try to resubmit jobs that are in the 'pending' state.
    def resubmit_pending_jobs
    Video.where(:state => "pending").find_each do |video|
    begin
    video.state = "submitting"
    video.save!
    
    submit_to_zencoder(video)
    rescue ActiveRecord::StaleObjectError
    end
    end
    end
    

    Also, by adding a 'lock_version' column to the videos table, we introduce optimistic locking. This means that if the record gets updated between the Video.find query and video.save, it won't submit the job to Zencoder. This will prevent the job to be submitted to Zencoder twice accidentally. You could use pessimistic or database locking or some other lock method to accomplish the same thing.

    It's that easy…

    All things considered, this is a pretty simple approach to ensuring 100% integration reliability between Zencoder and your application. It's a few more steps than just naively submitting a job; but it ensures that no matter what happens - whether it's an occasional timeout, or unexpected downtime at Zencoder, or scheduled maintenance - your app runs reliably.