Upstart: Job getting stuck in the start/killed state
We’re using upstart to handle the processes running on our machines and since the haproxy package only came package with an init.d script we wanted to make it upstartified.
When defining an upstart script you need to specify an expect stanza in which you specify whether or not the process which you’re launching is going to fork.
If you do not specify the expect stanza, Upstart will track the life cycle of the first PID that it executes in the exec or script stanzas. However, most Unix services will "daemonize", meaning that they will create a new process (using fork(2)) which is a child of the initial process. Often services will "double fork" to ensure they have no association whatsoever with the initial process.
There is a table on the upstart cookbook under the 'Implications of Misspecifying expect' section which explains what will happen if we specify this incorrectly:
Specification of Expect Stanza | |||
---|---|---|---|
Forks | no expect | expect fork | expect daemon |
0 | Correct | start hangs | start hangs |
1 | Wrong pid tracked † | Correct | start hangs |
2 | Wrong pid tracked | Wrong pid tracked | Correct |
When we were defining our script we went for expect daemon instead of expect fork and had also mistyped the arguments to the haproxy script which meant it failed to start and ended up in the start/killed state.
From what we could tell upstart had a handle on a PID which didn’t actually exist and when we tried a stop haproxy the command seemed to succeed but didn’t actually do anything.
Phil pointed us to a neat script written by Clint Byrum which spins up and then kills loads of processes in order to exhaust the PID space until a process with the PID upstart is tracking exists and can be re-attached and killed.
It’s available on his website but that wasn’t responding for a period of time yesterday so I’ll repeat it here just in case:
#!/usr/bin/env ruby1.8
class Workaround
def initialize target_pid
@target_pid = target_pid
first_child
end
def first_child
pid = fork do
Process.setsid
rio, wio = IO.pipe
# Keep rio open
until second_child rio, wio
print "\e[A"
end
end
Process.wait pid
end
def second_child parent_rio, parent_wio
rio, wio = IO.pipe
pid = fork do
rio.close
parent_wio.close
puts "%20.20s" % Process.pid
if Process.pid == @target_pid
wio << 'a'
wio.close
parent_rio.read
end
end
wio.close
begin
if rio.read == 'a'
true
else
Process.wait pid
false
end
ensure
rio.close
end
end
end
if $0 == __FILE__
pid = ARGV.shift
raise "USAGE: #{$0} pid" if pid.nil?
Workaround.new Integer pid
end
We can put that into a shell script, run it and the world of upstart will get back into a good place again!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.