29 Sep 2012

Upstart: Job getting stuck in the start/killed state

We’re using upstart to handle the processes running on our machines and since the haproxy package only came package with an init.d script we wanted to make it upstartified.

When defining an upstart script you need to specify an expect stanza in which you specify whether or not the process which you’re launching is going to fork.

If you do not specify the expect stanza, Upstart will track the life cycle of the first PID that it executes in the exec or script stanzas. However, most Unix services will "daemonize", meaning that they will create a new process (using fork(2)) which is a child of the initial process. Often services will "double fork" to ensure they have no association whatsoever with the initial process.

There is a table on the upstart cookbook under the 'Implications of Misspecifying expect' section which explains what will happen if we specify this incorrectly:

Expect Stanza Behaviour
	Specification of Expect Stanza
Forks	no `expect`	`expect fork`	`expect daemon`
0	Correct	start hangs	start hangs
1	Wrong pid tracked †	Correct	start hangs
2	Wrong pid tracked	Wrong pid tracked	Correct

When we were defining our script we went for expect daemon instead of expect fork and had also mistyped the arguments to the haproxy script which meant it failed to start and ended up in the start/killed state.

From what we could tell upstart had a handle on a PID which didn’t actually exist and when we tried a stop haproxy the command seemed to succeed but didn’t actually do anything.

Phil pointed us to a neat script written by Clint Byrum which spins up and then kills loads of processes in order to exhaust the PID space until a process with the PID upstart is tracking exists and can be re-attached and killed.

It’s available on his website but that wasn’t responding for a period of time yesterday so I’ll repeat it here just in case:

#!/usr/bin/env ruby1.8

class Workaround
  def initialize target_pid
    @target_pid = target_pid

    first_child
  end

  def first_child
    pid = fork do
      Process.setsid

      rio, wio = IO.pipe

      # Keep rio open
      until second_child rio, wio
        print "\e[A"
      end
    end

    Process.wait pid
  end

  def second_child parent_rio, parent_wio
    rio, wio = IO.pipe

    pid = fork do
      rio.close
      parent_wio.close

      puts "%20.20s" % Process.pid

      if Process.pid == @target_pid
        wio << 'a'
        wio.close

        parent_rio.read
      end
    end
    wio.close

    begin
      if rio.read == 'a'
        true
      else
        Process.wait pid
        false
      end
    ensure
      rio.close
    end
  end
end

if $0 == __FILE__
  pid = ARGV.shift
  raise "USAGE: #{$0} pid" if pid.nil?
  Workaround.new Integer pid
end

We can put that into a shell script, run it and the world of upstart will get back into a good place again!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.