We use Resque for processing our background jobs in Rails. We initially implemented Resque to send a few emails, but have since expanded our use greatly.
As we’ve begun relying more on Resque for all aspects of our app, we will occasionally experience long running jobs that get stuck. We know the job has finished (an email was sent or a file was generated), but the job holds on to the worker and clogs our queue. Here is our method for killing the stuck jobs and getting our queue going again.
- Log into the machine that has the stuck worker and retrieve a list of the currently running Resque processes.
$ ps -e -o pid,command | grep [r]esque 4332 resque-1.20.0: Waiting for * 25859 resque-1.20.0: Waiting for * 24130 resque-1.20.0: Forked 24201 at 1345203664 24201 resque-1.20.0: Processing email_job since 1345203664
- What we want to do is send the SIGUSR1 signal to the parent process. This will tell the process to immediately kill the child process and continue working other jobs. You can get a list of which signals the processes respond to and what they do at https://github.com/defunkt/resque/#signals.
- To figure out which process is the parent process look for the one that says “Forked N at T” where N is the process number of the child that is currently proccessing the stuck job.
- Not sure which number is SIGUSR1? Use
$ kill -l 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP 6) SIGABRT 7) SIGEMT 8) SIGFPE 9) SIGKILL 10) SIGBUS 11) SIGSEGV 12) SIGSYS 13) SIGPIPE 14) SIGALRM 15) SIGTERM 16) SIGURG 17) SIGSTOP 18) SIGTSTP 19) SIGCONT 20) SIGCHLD 21) SIGTTIN 22) SIGTTOU 23) SIGIO 24) SIGXCPU 25) SIGXFSZ 26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGINFO 30) SIGUSR1 31) SIGUSR2
- So we see that 30 is SIGUSR1. Finally, send that to the parent process.
$ kill -30 24130
- This will kill the child process immediately and allow the worker to get back to work.
- All of this information is in the Resque README, but it still took me some time to figure it out. So I thought I would just write a quick post on how we’re handling this.
- If you’ve seen similar issues with stuck long running jobs and have been able to prevent them, please let us know.
Interested in building something great?
Join us in building the worlds most loved hotel app.
View our open engineering positions.