Java线程池动态配置以及注意点

序

Java线程池这个东西也算是八股文常客了，但是本人在实际应用监控下发现了之前疏忽的问题遂决定再论一番，简单谈谈Java线程池的动态配置以及对于线程池使用和理解上容易疏忽的问题，相信理清之后会对线程池的优化能够有更准确的考量。首先抛出一个问题，即副标题所言：线程池线程数量超过核心线程数以后过一段时间后会减少么？如果这个线程池保持一定的调用量。

注：本文所述Java版本为1.8 。

动态配置

我们知道线程池的核心属性和成员（就构造函数的入参）主要有：corePoolSize（核心线程数）、maximumPoolSize（最大线程数）、keepAliveTime（线程空闲回收时间）、workQueue（存放任务的阻塞队列）、threadFactory（线程工厂）、rejectHandler（拒绝策略）。查看源码中的set方法，可知其中能够动态调整的有：corePoolSize、keepAliveTime、maximumPoolSize、rejectHandler、threadFactory以及allowCoreThreadTimeOut。而其中对线程运行状态有影响、有调整价值的主要为corePoolSize和maximumPoolSize以及keepAliveTime。这些方法的操作比我原本想象的要简单一些，直接把代码贴出来：

/**
 * Sets the thread factory used to create new threads.
 *
 * @param threadFactory the new thread factory
 * @throws NullPointerException if threadFactory is null
 * @see #getThreadFactory
 */
public void setThreadFactory(ThreadFactory threadFactory) {
    if (threadFactory == null)
        throw new NullPointerException();
    this.threadFactory = threadFactory;
}

/**
 * Sets a new handler for unexecutable tasks.
 *
 * @param handler the new handler
 * @throws NullPointerException if handler is null
 * @see #getRejectedExecutionHandler
 */
public void setRejectedExecutionHandler(RejectedExecutionHandler handler) {
    if (handler == null)
        throw new NullPointerException();
    this.handler = handler;
}


/**
 * Sets the core number of threads.  This overrides any value set
 * in the constructor.  If the new value is smaller than the
 * current value, excess existing threads will be terminated when
 * they next become idle.  If larger, new threads will, if needed,
 * be started to execute any queued tasks.
 *
 * @param corePoolSize the new core size
 * @throws IllegalArgumentException if {@code corePoolSize < 0}
 * @see #getCorePoolSize
 */
public void setCorePoolSize(int corePoolSize) {
    if (corePoolSize < 0)
        throw new IllegalArgumentException();
    int delta = corePoolSize - this.corePoolSize;
    this.corePoolSize = corePoolSize;
    if (workerCountOf(ctl.get()) > corePoolSize)
        interruptIdleWorkers();
    else if (delta > 0) {
        // We don't really know how many new threads are "needed".
        // As a heuristic, prestart enough new workers (up to new
        // core size) to handle the current number of tasks in
        // queue, but stop if queue becomes empty while doing so.
        int k = Math.min(delta, workQueue.size());
        while (k-- > 0 && addWorker(null, true)) {
            if (workQueue.isEmpty())
                break;
        }
    }
}

/**
 * Sets the policy governing whether core threads may time out and
 * terminate if no tasks arrive within the keep-alive time, being
 * replaced if needed when new tasks arrive. When false, core
 * threads are never terminated due to lack of incoming
 * tasks. When true, the same keep-alive policy applying to
 * non-core threads applies also to core threads. To avoid
 * continual thread replacement, the keep-alive time must be
 * greater than zero when setting {@code true}. This method
 * should in general be called before the pool is actively used.
 *
 * @param value {@code true} if should time out, else {@code false}
 * @throws IllegalArgumentException if value is {@code true}
 *         and the current keep-alive time is not greater than zero
 *
 * @since 1.6
 */
public void allowCoreThreadTimeOut(boolean value) {
    if (value && keepAliveTime <= 0)
        throw new IllegalArgumentException("Core threads must have nonzero keep alive times");
    if (value != allowCoreThreadTimeOut) {
        allowCoreThreadTimeOut = value;
        if (value)
            interruptIdleWorkers();
    }
}


/**
 * Sets the maximum allowed number of threads. This overrides any
 * value set in the constructor. If the new value is smaller than
 * the current value, excess existing threads will be
 * terminated when they next become idle.
 *
 * @param maximumPoolSize the new maximum
 * @throws IllegalArgumentException if the new maximum is
 *         less than or equal to zero, or
 *         less than the {@linkplain #getCorePoolSize core pool size}
 * @see #getMaximumPoolSize
 */
public void setMaximumPoolSize(int maximumPoolSize) {
    if (maximumPoolSize <= 0 || maximumPoolSize < corePoolSize)
        throw new IllegalArgumentException();
    this.maximumPoolSize = maximumPoolSize;
    if (workerCountOf(ctl.get()) > maximumPoolSize)
        interruptIdleWorkers();
}


/**
 * Sets the time limit for which threads may remain idle before
 * being terminated.  If there are more than the core number of
 * threads currently in the pool, after waiting this amount of
 * time without processing a task, excess threads will be
 * terminated.  This overrides any value set in the constructor.
 *
 * @param time the time to wait.  A time value of zero will cause
 *        excess threads to terminate immediately after executing tasks.
 * @param unit the time unit of the {@code time} argument
 * @throws IllegalArgumentException if {@code time} less than zero or
 *         if {@code time} is zero and {@code allowsCoreThreadTimeOut}
 * @see #getKeepAliveTime(TimeUnit)
 */
public void setKeepAliveTime(long time, TimeUnit unit) {
    if (time < 0)
        throw new IllegalArgumentException();
    if (time == 0 && allowsCoreThreadTimeOut())
        throw new IllegalArgumentException("Core threads must have nonzero keep alive times");
    long keepAliveTime = unit.toNanos(time);
    long delta = keepAliveTime - this.keepAliveTime;
    this.keepAliveTime = keepAliveTime;
    if (delta < 0)
        interruptIdleWorkers();
}

可以发现这几个动态配置除了核心线程调大时会判断立刻增加一些worker之外，主要操作就是赋值然后判断是否需要打断空闲线程（调用interruptIdleworkers方法），所以线程池动态调整的核心就是给空闲线程发送interrupt信号，我们看一下这个打断方法，也很简单：

/**
 * Common form of interruptIdleWorkers, to avoid having to
 * remember what the boolean argument means.
 */
private void interruptIdleWorkers() {
    interruptIdleWorkers(false);
}

/**
 * Interrupts threads that might be waiting for tasks (as
 * indicated by not being locked) so they can check for
 * termination or configuration changes. Ignores
 * SecurityExceptions (in which case some threads may remain
 * uninterrupted).
 *
 * @param onlyOne If true, interrupt at most one worker. This is
 * called only from tryTerminate when termination is otherwise
 * enabled but there are still other workers.  In this case, at
 * most one waiting worker is interrupted to propagate shutdown
 * signals in case all threads are currently waiting.
 * Interrupting any arbitrary thread ensures that newly arriving
 * workers since shutdown began will also eventually exit.
 * To guarantee eventual termination, it suffices to always
 * interrupt only one idle worker, but shutdown() interrupts all
 * idle workers so that redundant workers exit promptly, not
 * waiting for a straggler task to finish.
 */
private void interruptIdleWorkers(boolean onlyOne) {
    final ReentrantLock mainLock = this.mainLock;
    mainLock.lock();
    try {
        for (Worker w : workers) {
            Thread t = w.thread;
            if (!t.isInterrupted() && w.tryLock()) {
                try {
                    t.interrupt();
                } catch (SecurityException ignore) {
                } finally {
                    w.unlock();
                }
            }
            if (onlyOne)
                break;
        }
    } finally {
        mainLock.unlock();
    }
}

循环所有的worker，尝试加锁（worker继承了AQS），如果能够锁上说明该worker是空闲的，随即发送interrupt信号然后解锁结束。很简单的逻辑，感觉没有什么好讨论，但是这也从侧面说明了线程池设计的独到之处，简单的操作实现有效地调整。

那么为了搞清楚线程池的状态会如何被动态配置所影响，有必要再明确一下线程池内管理线程工作的worker对象的运行流程。其实看下源码，单纯只关注线程的运行其实也没多少东西，就是两个方法：runWorker和getTask，这里直接贴下源码以及美团技术博客里盗来的图，美团的那一篇文章还是不错的，甚至还成为他们2020年阅读量最高的文章，但是那里并没有考虑到本文将要论述的问题，我还挺意外。

// 工作线程运行逻辑
final void runWorker(Worker w) {
    Thread wt = Thread.currentThread();
    Runnable task = w.firstTask;
    w.firstTask = null;
    w.unlock(); // allow interrupts
    boolean completedAbruptly = true;
    try {
        // 一句话概括：持续从阻塞队列里取任务，取到了就执行任务，
        // 取不到(通常是等待超时)就结束运行
        while (task != null || (task = getTask()) != null) {
            w.lock();
            // If pool is stopping, ensure thread is interrupted;
            // if not, ensure thread is not interrupted.  This
            // requires a recheck in second case to deal with
            // shutdownNow race while clearing interrupt
            if ((runStateAtLeast(ctl.get(), STOP) ||
                 (Thread.interrupted() &&
                  runStateAtLeast(ctl.get(), STOP))) &&
                !wt.isInterrupted())
                wt.interrupt();
            try {
                beforeExecute(wt, task);
                Throwable thrown = null;
                try {
                    // 执行任务
                    task.run();
                } catch (RuntimeException x) {
                    thrown = x; throw x;
                } catch (Error x) {
                    thrown = x; throw x;
                } catch (Throwable x) {
                    thrown = x; throw new Error(x);
                } finally {
                    afterExecute(task, thrown);
                }
            } finally {
                task = null;
                w.completedTasks++;
                w.unlock();
            }
        }
        completedAbruptly = false;
    } finally {
        processWorkerExit(w, completedAbruptly);
    }
}

// 从阻塞队列中获取任务的逻辑
private Runnable getTask() {
    boolean timedOut = false; // Did the last poll() time out?

    for (;;) {
        int c = ctl.get();
        int rs = runStateOf(c);

        // Check if queue empty only if necessary.
        if (rs >= SHUTDOWN && (rs >= STOP || workQueue.isEmpty())) {
            decrementWorkerCount();
            return null;
        }

        int wc = workerCountOf(c);

        // Are workers subject to culling?
        boolean timed = allowCoreThreadTimeOut || wc > corePoolSize;

        // 满足2个条件将会尝试结束运行
        // 1. 当前线程数超过最大线程数 或者 等待超过keepalive时间
        // 2. 当前线程数大于1 或者 阻塞队列为空
        if ((wc > maximumPoolSize || (timed && timedOut))
            && (wc > 1 || workQueue.isEmpty())) {
            // 典型的循环CAS逻辑
            if (compareAndDecrementWorkerCount(c))
                return null;
            continue;
        }

        // interrupt会影响poll/take方法
        try {
            Runnable r = timed ?
                workQueue.poll(keepAliveTime, TimeUnit.NANOSECONDS) :
                workQueue.take();
            if (r != null)
                return r;
            // 超时标记
            timedOut = true;
        } catch (InterruptedException retry) {
            // 被打断不算做超时
            timedOut = false;
        }
    }
}

worker运行流程图：

感觉也不需要多赘述，写了些注释，看一遍都能看懂的。总结一下：worker线程就是不停地向任务队列请求获取任务，能取到任务则执行任务，如果取不到则根据各种条件判断是否需要退出执行还是进入下一轮请求；开始执行任务时会加锁，执行结束释放锁，因此可以根据tryLock判断线程是否是活跃状态即正在执行任务。

看完了这两部分，动态配置所造成的影响应该就比较清晰了，对所有非活跃的线程发送interrupt信号，也就是打断在getTask方法的循环中对阻塞队列的poll/take方法，触发下一轮逻辑判断是否退出还是继续等待获取task。

易疏忽的问题

现在回过头来看开头就抛出的问题，根据我的观察一定会有不少人认为会或者对此心里没有一个肯定的数。当然了，要准确的回答这个问题还是需要根据具体场景来分类讨论，直接说会或者不会都是不太妥当的，但是很多时候并不是如你所想的那样线程数量会减少到核心线程数。原因是在常见的使用情况下：

大多数情况下线程池中task的平均耗时不会很长，尤其是对于实时响应的系统，不太会超过1秒
keepalive的设置一般会给到60秒，这也是Spring对线程池封装以及Executors.newCachedThreadPool创建时的默认值
系统稳定，保持一定的调用量，活跃线程数不会跌零

那么在这种情况下，线程池内的一个线程要饥渴到60秒内一个任务都拿不到才会退出，对于一个活跃的系统发生这种情况的概率是非常低的，除非线程池的线程数非常非常大。为了有更直观的说明，我简单做了匀速下的调用实验，通过http暴露接口动态修改线程池的参数以及生产消费的速度，不断打印线程池状态，数据如下：

corePoolSize=50，maximumPoolSize=300，先加大生产速度让线程数超过核心线程数，然后降速观察。线程数的下降是相对非常缓慢的，要经过十多分钟才会稳定

keepalive=60s，producer=20ms/次，consumer=500ms/次，活跃线程数=25，当前线程数=300
keepalive=5s，producer=20ms/次，consumer=500ms/次，活跃线程数=25，当前线程数=145
keepalive=3s，producer=20ms/次，consumer=500ms/次，活跃线程数=25，当前线程数=86
keepalive=2s，producer=20ms/次，consumer=500ms/次，活跃线程数=25，当前线程数=64

那么这种情况所带来的最大的影响是线程池的行为逻辑将在运行过程中发生改变。在线程池新创建后经历第一轮调用高峰时，可能会经历如下过程：

活跃线程数达到核心线程数 => 阻塞队列塞满 => 活跃线程数超出核心线程数 => 活跃线程数达到最大线程数 => 触发任务拒绝策略

而此后调用高峰结束回归正常，此后再次遇到调用高峰线程池的行为将与第一次不同，可能的过程是：

活跃线程直接达到最大线程数 => 阻塞队列塞满 => 触发任务拒绝策略

在这种情况下，经过第一次流量高峰后相当于这个线程池变成了一个fixedThreadPool（core=max），那么对于核心线程数和最大线程数的设置很有可能就需要重新考量一番了。本质上来说，这是keepalivetime这个配置所造成的影响，而绝大部分文章都没有关注到keepalivetime要如何配置，通常讨论的都是核心线程数与最大线程数。

所以，对于这种消费和生产速度都比较快的应用场景，keepalivetime也是一个需要仔细考量的配置点，至于到底需要配多少本人目前也没有什么太准确的经验，长短都有利弊：时间配的长了，经过流量高峰后核心线程数的设置就可能会失去它原来的意义；时间配的短了，如果出现抖动，会有频繁的创建、销毁线程的问题。从纯理论角度上来说，如果要彻底贯彻核心线程满且队列溢出情况下再创建额外线程来执行消费的处理逻辑，那么keepalivetime可以设的很短甚至是0，这样能够在流量高峰过去后迅速销毁多余的线程。

如果线程池内主要执行IO调用为主的任务，且下游承压能力有限，我认为完全可以考虑使用fixedThreadPool，线程数量的设置根据下游最大承受能力来确定就可以，队列的长度保证能够兜住可控范围内的高峰流量即可（如果关注延迟则可以考虑给较小的值或者直接拒绝），此时keepalivetime配置就不会有任何作用。此观点的考量在于，目前的机器配置情况下多几百个不吃CPU的线程并不会有特别大的影响，顶多是锁争抢激烈一些，而IO调用为主的情况下，线程主要是发送请求然后等待调用结果返回，只要在下游承受压力范围内，有多少量就用多少线程，队列满了再增加线程意义不大。当然了，reactive可能是更高级的解法，但这不在本文讨论范围内。

如果既要维持相对较高的keepalive时间，又希望线程池在正常情况下线程不超过核心线程数，那么也可以使用动态配置来人为重置线程池的状态，只不过这个操作无法一次到位，你需要先调低最大线程数到核心线程数大小，待当前线程数量下降完成后再把最大线程数调回来。

总结

论述完毕，总结浓缩需要注意的几个点：

线程池内的线程都是一样的，没有某个线程是核心线程这种说法，只是会根据当前线程数量做不同的操作。
线程池经过任务提交高峰后线程数量超过核心线程数时，线程超过keepalivetime时间仍未获取到任何task才会退出。
如果keepalivetime较大且任务提交速度较快，线程池在经历一轮高峰后，核心线程数可能会失去它的意义。
可以动态调整线程池的核心线程数、最大线程数、keepalivetime，这些动态调整操作基本上就是成员赋值然后中断一下空闲线程等待从队列中取任务，以触发进入下一轮循环重新进行逻辑判断。
如果当前线程数量稳定维持大于核心线程数，动态调整核心线程数到任意小于当前线程数的值基本都是无用的，因为线程并不会退出，它们依然可以在keepalivetime时间内从队列中取到task。只有调到大于当前线程数时才会造成线程数量增加。
最大线程数是一个比较强的约束，把最大线程数调整到当前线程数以下通常可以在很短时间内回收多余的线程，除非所有线程都在执行长耗时任务。