今回介绍入门一款实用性不错的测试工具——JMH,其实了解这个东西也是一段时间之前的事情了,拖拖拉拉到现在,感觉已经忘记了大部分,所以就当重温吧。了解JMH的同时会自然而然接触到一些JVM相关的知识,值得学习一番。

简介

JMH全称Java Microbenchmark Harness,翻译过来就是Java微基准测试工具套件。很明显它是一款Java的测试工具,而其中的微基准则表明了它的适用层级。对代码性能的追逐是码农常常需要做的事情,那么代码的性能到底怎么样,不能靠嘴巴说而需要量化的指标,很多开源工具会给出JMH的对比测试结果来显示自己性能是如何的优越。如今计算机的算力对于执行一段代码块来说,很有可能就是几纳秒的事情,因此为了得出“肉眼可见”的结论,往往需要循环重试。没有接触JMH之前我相信大多数人都做过把一个方法用for循环执行n次并且记录起始结束时间来验证这个方法耗时如何的事情,这对于纯编译执行的语言或许没什么问题,但是对于Java或者基于JVM的语言来说并不能得到最准确的结果,JVM做了很多我们看不到的事情,所以同一个测试运行多次可能会看到差别较大的结果。而JMH就是为了解决这个问题而来, 它由JVM开发人员编写,编写JMH不是一件容易的事情,因为这需要非常熟悉JVM的运行机制。

用法说明

JDK9以上的版本自带了JMH,其他版本则需要引入相关的依赖。JMH的主页很简单,基本上就只是有一个指向Github项目地址的连接,而Github项目中的主页也只是给出了一些简单的用法说明,其余的只是告诉你去看样例来理解。其实这样我觉得挺不错,所以本篇的内容主要就是照着样例一个一个解释。

官方项目说明文档中写明了推荐使用命令行来执行测试,首先采用maven来创建项目的基本骨架,然后编写测试代码并打包,最后使用命令行调用执行jar包。在编写时推荐将JMH构建成为一个独立的项目,在关系上依赖具体的应用项目,这样能够确保基准测试程序正确地初始化并产生可靠的结果。当然也可以选择在IDE中直接运行,现在流行的IDE如IDEA中也提供了相关插件,在进行了解学习时是个不错的使用方式。

样例说明

01 HelloWorld

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public class JMHSample_01_HelloWorld {

@Benchmark
public void wellHelloThere() {
// this method was intentionally left blank.
}

public static void (String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_01_HelloWorld.class.getSimpleName())
.forks(1)
.build();

new Runner(opt).run();
}

}

JMH的工作方式如下: 用户使用@benchmark 注释方法,然后 JMH执行生成的代码,以此尽可能可靠地执行该测试方法。请阅读@Benchmark的javadoc注释来了解完整的语义和限制。方法名称并不重要,只要方法用@benchmark 它就会被认为是一个基准测试方法,在同一个类中可以有多个基准方法。注意如果基准测试方法永远不结束,那么JMH运行也永远不会结束。如果您从方法体中抛出异常,JMH 运行会立刻结束这个基准测试,然后执行列表中的下一个基准测试。尽管这个基准测试什么也没有执行,但它很好地展示了基础结构对于测量的负载,没有任何基础设施不会招致任何开销,重要的是要知道你正在处理的基础管理费用是多少。在将来的示例中,你可能会发现这种思想是通过比较“基线”测量结果而展开的。

02 BenchmarkModes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
public class JMHSample_02_BenchmarkModes {

@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
public void measureThroughput() throws InterruptedException {
TimeUnit.MILLISECONDS.sleep(100);
}

@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void measureAvgTime() throws InterruptedException {
TimeUnit.MILLISECONDS.sleep(100);
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void measureSamples() throws InterruptedException {
TimeUnit.MILLISECONDS.sleep(100);
}

@Benchmark
@BenchmarkMode(Mode.SingleShotTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void measureSingleShot() throws InterruptedException {
TimeUnit.MILLISECONDS.sleep(100);
}

@Benchmark
@BenchmarkMode({Mode.Throughput, Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime})
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void measureMultiple() throws InterruptedException {
TimeUnit.MILLISECONDS.sleep(100);
}

@Benchmark
@BenchmarkMode(Mode.All)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void measureAll() throws InterruptedException {
TimeUnit.MILLISECONDS.sleep(100);
}

}

这个例子介绍了注解@BenchmarkMode以及配合使用的@OutputTimeUnit,这个注解接收的枚举值代表了测试类型,注意注解接收的是数组类型,这代表你可以同时执行多种测试。

  • Throughput:单位时间内的执行次数。过在有限迭代时间内不断调用基准方法并计算执行该方法的次数来度量原始吞吐量。
  • AverageTime:每次执行的平均耗时。它与Throughput相似,只是有时度量时间更方便。
  • SampleTime:采样每次执行的时间。在这种模式下,仍然是在有时间限制的迭代中运行该方法,但是不测量总时间,而是测量某几次调用测试方法所花费的时间。主要是为了推断时间分布和百分比。JMH会尝试自动调整采样频率,如果方法执行过于缓慢会导致所有执行都会被采集。
  • SingleShotTime:测量单次执行时间。迭代次数在这种模式下是无意义的,这种模式对于测试冷启动执行效果很有用。
  • All:所有模式集合。

样例的javadoc中还说明了如果你对某些执行行为感到疑惑,可以尝试查看生成的代码,你可能会发现代码并没有在做你期望做的事情。

03 States

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
public class JMHSample_03_States {

@State(Scope.Benchmark)
public static class BenchmarkState {
volatile double x = Math.PI;
}

@State(Scope.Thread)
public static class ThreadState {
volatile double x = Math.PI;
}

@Benchmark
public void measureUnshared(ThreadState state) {
// All benchmark threads will call in this method.
//
// However, since ThreadState is the Scope.Thread, each thread
// will have it's own copy of the state, and this benchmark
// will measure unshared case.
state.x++;
}

@Benchmark
public void measureShared(BenchmarkState state) {
// All benchmark threads will call in this method.
//
// Since BenchmarkState is the Scope.Benchmark, all threads
// will share the state instance, and we will end up measuring
// shared case.
state.x++;
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_03_States.class.getSimpleName())
.threads(4)
.forks(1)
.build();

new Runner(opt).run();
}

}

很多时候在执行基准测试的时候你需要维护某些状态,同时JMH经常用于构建并发型基准测试,因此提供了状态对象的标记注解:@State,使用其标注的对象将会被按需构建并且在整个测试过程中按照给定的范围重用。注意State对象总是会被某一个需要获取它的线程实例化,这意味着你可以像在工作线程中那样初始化字段。基准测试方法可以直接引用这些State对象(作为方法参数),JMH会自动做注入操作。

04 Default State

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@State(Scope.Thread)
public class JMHSample_04_DefaultState {

double x = Math.PI;

@Benchmark
public void measure() {
x++;
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_04_DefaultState.class.getSimpleName())
.forks(1)
.build();

new Runner(opt).run();
}

}

很多情况下你只需要一个状态对象,此时你可以选择将基准测试类自身标记@State,这样就能够很方便地引用自身的成员。

05 State Fixtures

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
@State(Scope.Thread)
public class JMHSample_05_StateFixtures {

double x;

@Setup
public void prepare() {
x = Math.PI;
}

@TearDown
public void check() {
assert x > Math.PI : "Nothing changed?";
}

@Benchmark
public void measureRight() {
x++;
}

@Benchmark
public void measureWrong() {
double x = 0;
x++;
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_05_StateFixtures.class.getSimpleName())
.forks(1)
.jvmArgs("-ea")
.build();

new Runner(opt).run();
}

}

因为State对象在benchmark生命周期中维持,因此相关状态管理方法会有所帮助,JMH提供了一些常见的状态管理方法,如果使用Junit或者TestNG会对这些非常熟悉。这些管理方法只会对State对象有效,否则JMH将会编译失败。同时方法只会在某个使用State对象的线程中调用,这意味着管理方法内是线程私有环境。

06 Fixture Level

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
@State(Scope.Thread)
public class JMHSample_06_FixtureLevel {

double x;

@TearDown(Level.Iteration)
public void check() {
assert x > Math.PI : "Nothing changed?";
}

@Benchmark
public void measureRight() {
x++;
}

@Benchmark
public void measureWrong() {
double x = 0;
x++;
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_06_FixtureLevel.class.getSimpleName())
.forks(1)
.jvmArgs("-ea")
.shouldFailOnError(false) // switch to "true" to fail the complete run
.build();

new Runner(opt).run();
}

}

状态管理方法可以在不同层级执行,主要提供了三种:

  1. Level.Trial:在整个benchmark执行前后调用
  2. Level.Iteration:在每次迭代执行前后调用
  3. Level.Invocation:在每次方法调用前后执行。注意如果要使用这个级别请仔细查看相关javadoc,了解其使用限制

执行状态管理方法耗费的时间不会统计入结果,所以在方法内可以做一些比较重的操作。

07 Fixture Level Invocation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class JMHSample_07_FixtureLevelInvocation {

/*
* Fixtures have different Levels to control when they are about to run.
* Level.Invocation is useful sometimes to do some per-invocation work,
* which should not count as payload. PLEASE NOTE the timestamping and
* synchronization for Level.Invocation helpers might significantly offset
* the measurement, use with care. See Level.Invocation javadoc for further
* discussion.
*
* Consider this sample:
*/

/*
* This state handles the executor.
* Note we create and shutdown executor with Level.Trial, so
* it is kept around the same across all iterations.
*/

@State(Scope.Benchmark)
public static class NormalState {
ExecutorService service;

@Setup(Level.Trial)
public void up() {
service = Executors.newCachedThreadPool();
}

@TearDown(Level.Trial)
public void down() {
service.shutdown();
}

}

/*
* This is the *extension* of the basic state, which also
* has the Level.Invocation fixture method, sleeping for some time.
*/

public static class LaggingState extends NormalState {
public static final int SLEEP_TIME = Integer.getInteger("sleepTime", 10);

@Setup(Level.Invocation)
public void lag() throws InterruptedException {
TimeUnit.MILLISECONDS.sleep(SLEEP_TIME);
}
}

/*
* This allows us to formulate the task: measure the task turnaround in
* "hot" mode when we are not sleeping between the submits, and "cold" mode,
* when we are sleeping.
*/

@Benchmark
@BenchmarkMode(Mode.AverageTime)
public double measureHot(NormalState e, final Scratch s) throws ExecutionException, InterruptedException {
return e.service.submit(new Task(s)).get();
}

@Benchmark
@BenchmarkMode(Mode.AverageTime)
public double measureCold(LaggingState e, final Scratch s) throws ExecutionException, InterruptedException {
return e.service.submit(new Task(s)).get();
}

/*
* This is our scratch state which will handle the work.
*/

@State(Scope.Thread)
public static class Scratch {
private double p;
public double doWork() {
p = Math.log(p);
return p;
}
}

public static class Task implements Callable<Double> {
private Scratch s;

public Task(Scratch s) {
this.s = s;
}

@Override
public Double call() {
return s.doWork();
}
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_07_FixtureLevelInvocation.class.getSimpleName())
.forks(1)
.build();

new Runner(opt).run();
}


}

给了一个Level.Invocation的使用示例。可以看到定义了三个State对象,并且前两个有继承关系,注意到状态对象方法的Level有所不同。两个benchmark方法虽然内容相同,但是因为使用了不同的State对象,measureCold方法会在每次调用前睡10ms,以此模拟对比线程池不同使用形式下的表现。

Level.Invocation对于每次执行都要执行一些前置或者后续操作时会比较方便,但是使用它你需要仔细阅读它的javadoc说明。在它的javadoc中说明它主要适用于执行时间超过1ms的方法,并给出了四点警示:

  1. 因为Setup、Teardown等方法不能计入性能统计结果,因此使用这个Level时必须对每次调用单独计时,如果方法调用时间很短,那么为了计时所发起的获取系统时间戳的调用将会影响测试结果甚至造成瓶颈
  2. 还是因为单独计时造成的问题,由于单独计时然后累加,这可能造成精度丢失,求和得到较短的时间
  3. 为了维持与其他Level相同的共享行为,JMH有时需要在访问state对象时进行synchronized同步,这有可能使测量结果偏移正确值
  4. 根据当前的实现,辅助方法与基准测试方法是交叠执行的,这在多线程基准测试时可能会有影响,比如某个线程在执行基准测试方法时可以观察到别的线程已经调用了TearDown从而导致发生异常。

08 Dead Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class JMHSample_08_DeadCode {

private double x = Math.PI;

@Benchmark
public void baseline() {
// do nothing, this is a baseline
}

@Benchmark
public void measureWrong() {
// This is wrong: result is not used and the entire computation is optimized away.
Math.log(x);
}

@Benchmark
public double measureRight() {
// This is correct: the result is being used.
return Math.log(x);
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_08_DeadCode.class.getSimpleName())
.forks(1)
.build();

new Runner(opt).run();
}

}

这个例子说明了Dead Code陷阱,许多基准测试失败的原因是因为没有考虑Dead-Code Elimination(DCE 死代码消除)。编译器非常聪明,能够推断出某些计算是多余的,并将其完全消除,如果被淘汰的部分是我们的基准测试代码,那么就会出现问题。所幸JMH提供了必要的基础设施来应对这种状况,你可以为方法定义返回值,将计算结果返回,这样JMH就会添加对DCE的对应处理。

09 Blackholes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class JMHSample_09_Blackholes {

double x1 = Math.PI;
double x2 = Math.PI * 2;

@Benchmark
public double baseline() {
return Math.log(x1);
}

/*
* While the Math.log(x2) computation is intact, Math.log(x1)
* is redundant and optimized out.
*/

@Benchmark
public double measureWrong() {
Math.log(x1);
return Math.log(x2);
}

/*
* This demonstrates Option A:
*
* Merge multiple results into one and return it.
* This is OK when is computation is relatively heavyweight, and merging
* the results does not offset the results much.
*/

@Benchmark
public double measureRight_1() {
return Math.log(x1) + Math.log(x2);
}

/*
* This demonstrates Option B:
*
* Use explicit Blackhole objects, and sink the values there.
* (Background: Blackhole is just another @State object, bundled with JMH).
*/

@Benchmark
public void measureRight_2(Blackhole bh) {
bh.consume(Math.log(x1));
bh.consume(Math.log(x2));
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_09_Blackholes.class.getSimpleName())
.forks(1)
.build();

new Runner(opt).run();
}

}

这个例子引出了最终处理DCE的对象Blackhole,如果基准测试方法只有一个计算结果那么你可以直接将其返回,JMH对隐式调用Blockhole来处理返回值。但是如果测试方法有多个返回值,则可以尝试直接引入Blackhole对象手动处理。

10 Constant Fold

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class JMHSample_10_ConstantFold {

// IDEs will say "Oh, you can convert this field to local variable". Don't. Trust. Them.
// (While this is normally fine advice, it does not work in the context of measuring correctly.)
private double x = Math.PI;

// IDEs will probably also say "Look, it could be final". Don't. Trust. Them. Either.
// (While this is normally fine advice, it does not work in the context of measuring correctly.)
private final double wrongX = Math.PI;

@Benchmark
public double baseline() {
// simply return the value, this is a baseline
return Math.PI;
}

@Benchmark
public double measureWrong_1() {
// This is wrong: the source is predictable, and computation is foldable.
return Math.log(Math.PI);
}

@Benchmark
public double measureWrong_2() {
// This is wrong: the source is predictable, and computation is foldable.
return Math.log(wrongX);
}

@Benchmark
public double measureRight() {
// This is correct: the source is not predictable.
return Math.log(x);
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_10_ConstantFold.class.getSimpleName())
.forks(1)
.build();

new Runner(opt).run();
}

}

这个例子与JVM的优化——常量折叠相关。如果JVM发现计算的结果无论如何都是一样的即是一个常量,它可以巧妙地对其进行优化。在给出的例子中,这意味着我们可以将计算移到内部JMH循环之外。通常我们可以通过读取非final的State对象字段来避免这种情况。注意IDE有时会给出将字段定义为final的建议,这对于普通代码来说是正确的,但是在基准测试情况下需要仔细考虑。

11 Loops

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class JMHSample_11_Loops {

/*
* Suppose we want to measure how much it takes to sum two integers:
*/

int x = 1;
int y = 2;

/*
* This is what you do with JMH.
*/

@Benchmark
public int measureRight() {
return (x + y);
}

/*
* The following tests emulate the naive looping.
* This is the Caliper-style benchmark.
*/
private int reps(int reps) {
int s = 0;
for (int i = 0; i < reps; i++) {
s += (x + y);
}
return s;
}

/*
* We would like to measure this with different repetitions count.
* Special annotation is used to get the individual operation cost.
*/

@Benchmark
@OperationsPerInvocation(1)
public int measureWrong_1() {
return reps(1);
}

@Benchmark
@OperationsPerInvocation(10)
public int measureWrong_10() {
return reps(10);
}

@Benchmark
@OperationsPerInvocation(100)
public int measureWrong_100() {
return reps(100);
}

@Benchmark
@OperationsPerInvocation(1_000)
public int measureWrong_1000() {
return reps(1_000);
}

@Benchmark
@OperationsPerInvocation(10_000)
public int measureWrong_10000() {
return reps(10_000);
}

@Benchmark
@OperationsPerInvocation(100_000)
public int measureWrong_100000() {
return reps(100_000);
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_11_Loops.class.getSimpleName())
.forks(1)
.build();

new Runner(opt).run();
}

}

这个例子表明了使用者不应该在基准测试方法中主动添加循环并减少方法调用次数。循环是为了最小化调用测试方法的开销,通过在内部循环而不是在方法调用层面循环调用——这个观点是不正确的,当我们允许优化器合并循环迭代时,你会看到一些意想不到的情况。

执行上面的代码可以发现,当JVM对内部循环进行优化以后,耗时表现有10倍的提升(机子不同可能有所差别)。

12 Forking

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class JMHSample_12_Forking {

/*
* Suppose we have this simple counter interface, and two implementations.
* Even though those are semantically the same, from the JVM standpoint,
* those are distinct classes.
*/

public interface Counter {
int inc();
}

public static class Counter1 implements Counter {
private int x;

@Override
public int inc() {
return x++;
}
}

public static class Counter2 implements Counter {
private int x;

@Override
public int inc() {
return x++;
}
}

/*
* And this is how we measure it.
* Note this is susceptible for same issue with loops we mention in previous examples.
*/

public int measure(Counter c) {
int s = 0;
for (int i = 0; i < 10; i++) {
s += c.inc();
}
return s;
}

/*
* These are two counters.
*/
Counter c1 = new Counter1();
Counter c2 = new Counter2();

/*
* We first measure the Counter1 alone...
* Fork(0) helps to run in the same JVM.
*/

@Benchmark
@Fork(0)
public int measure_1_c1() {
return measure(c1);
}

/*
* Then Counter2...
*/

@Benchmark
@Fork(0)
public int measure_2_c2() {
return measure(c2);
}

/*
* Then Counter1 again...
*/

@Benchmark
@Fork(0)
public int measure_3_c1_again() {
return measure(c1);
}

/*
* These two tests have explicit @Fork annotation.
* JMH takes this annotation as the request to run the test in the forked JVM.
* It's even simpler to force this behavior for all the tests via the command
* line option "-f". The forking is default, but we still use the annotation
* for the consistency.
*
* This is the test for Counter1.
*/

@Benchmark
@Fork(1)
public int measure_4_forked_c1() {
return measure(c1);
}

/*
* ...and this is the test for Counter2.
*/

@Benchmark
@Fork(1)
public int measure_5_forked_c2() {
return measure(c2);
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_12_Forking.class.getSimpleName())
.build();

new Runner(opt).run();
}

}

JVM 擅长profile-guided optimizations。 这对基准测试来说是不利的,因为不同的测试可以将它们的profile混合在一起,然后为每个测试提供“统一糟糕”的代码。 Fork(在单独的进程中运行)每个测试可以规避这个问题。JMH默认会Fork进程来处理测试方法。可以在测试时查看进程来验证。

上面的样例代码中,Counter1和Counter2在逻辑上是等价的,但是在JVM看来仍然是不同的对象。因此在同一个进程中交替混合执行两种计数方法,会导致性能反而出现下降的情况,measure_3_c1_again的表现会明显差于measure_1_c1。

13 Run To Run

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class JMHSample_13_RunToRun {

/*
* In order to introduce readily measurable run-to-run variance, we build
* the workload which performance differs from run to run. Note that many workloads
* will have the similar behavior, but we do that artificially to make a point.
*/

@State(Scope.Thread)
public static class SleepyState {
public long sleepTime;

@Setup
public void setup() {
sleepTime = (long) (Math.random() * 1000);
}
}

/*
* Now, we will run this different number of times.
*/

@Benchmark
@Fork(1)
public void baseline(SleepyState s) throws InterruptedException {
TimeUnit.MILLISECONDS.sleep(s.sleepTime);
}

@Benchmark
@Fork(5)
public void fork_1(SleepyState s) throws InterruptedException {
TimeUnit.MILLISECONDS.sleep(s.sleepTime);
}

@Benchmark
@Fork(20)
public void fork_2(SleepyState s) throws InterruptedException {
TimeUnit.MILLISECONDS.sleep(s.sleepTime);
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_13_RunToRun.class.getSimpleName())
.warmupIterations(0)
.measurementIterations(3)
.build();

new Runner(opt).run();
}

}

JVM是一个复杂的系统,这也会导致很多的不确定性。有时我们必须要考虑单次执行的差异性,而JMH提供的Fork特性在规避PGO的同时也会自动将所有进程的结果归入统计结果,方便我们使用。代码样例中,sleepTime由随机数计算得出,以此模拟每次执行的差异性。

14 N/A

样例被删除了?

15 Asymmetric

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
@State(Scope.Group)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class JMHSample_15_Asymmetric {

private AtomicInteger counter;

@Setup
public void up() {
counter = new AtomicInteger();
}

@Benchmark
@Group("g")
@GroupThreads(3)
public int inc() {
return counter.incrementAndGet();
}

@Benchmark
@Group("g")
@GroupThreads(1)
public int get() {
return counter.get();
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_15_Asymmetric.class.getSimpleName())
.forks(1)
.build();

new Runner(opt).run();
}

}

这个例子介绍了Group的概念,在此之前,所有的测试都是对称一致的,所有的线程执行相同的代码。有了Group就可以执行非对称测试,它可以将多个方法绑定在一起并且规定线程应该如何分布。以上述代码为例,两个方法inc和get都属于同一个group g,但是分配了不同的线程数量,执行测试时可以发现有3个线程执行inc方法、1个线程执行get方法。如果使用4个线程来执行测试,只会生成一个执行组,使用4*N个线程将会调用N个执行组。

注意State对象的范围还包括Scope.Group,这能够使得State对象在每个group内部分享。

16 Complier Control

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class JMHSample_16_CompilerControl {

/**
* These are our targets:
* - first method is prohibited from inlining
* - second method is forced to inline
* - third method is prohibited from compiling
*
* We might even place the annotations directly to the benchmarked
* methods, but this expresses the intent more clearly.
*/

public void target_blank() {
// this method was intentionally left blank
}

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void target_dontInline() {
// this method was intentionally left blank
}

@CompilerControl(CompilerControl.Mode.INLINE)
public void target_inline() {
// this method was intentionally left blank
}

@CompilerControl(CompilerControl.Mode.EXCLUDE)
public void target_exclude() {
// this method was intentionally left blank
}

/*
* These method measures the calls performance.
*/

@Benchmark
public void baseline() {
// this method was intentionally left blank
}

@Benchmark
public void blank() {
target_blank();
}

@Benchmark
public void dontinline() {
target_dontInline();
}

@Benchmark
public void inline() {
target_inline();
}

@Benchmark
public void exclude() {
target_exclude();
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_16_CompilerControl.class.getSimpleName())
.warmupIterations(0)
.measurementIterations(3)
.forks(1)
.build();

new Runner(opt).run();
}

}

这个例子表明了可以使用注解来告诉编译器执行一些特定的操作,比如是否进行方法内联(inline)。具体查看上面的代码就可以,比较明确。

17 Sync Iterations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class JMHSample_17_SyncIterations {

/*
* This is the another thing that is enabled in JMH by default.
*
* Suppose we have this simple benchmark.
*/

private double src;

@Benchmark
public double test() {
double s = src;
for (int i = 0; i < 1000; i++) {
s = Math.sin(s);
}
return s;
}

/*
* It turns out if you run the benchmark with multiple threads,
* the way you start and stop the worker threads seriously affects
* performance.
*
* The natural way would be to park all the threads on some sort
* of barrier, and the let them go "at once". However, that does
* not work: there are no guarantees the worker threads will start
* at the same time, meaning other worker threads are working
* in better conditions, skewing the result.
*
* The better solution would be to introduce bogus iterations,
* ramp up the threads executing the iterations, and then atomically
* shift the system to measuring stuff. The same thing can be done
* during the rampdown. This sounds complicated, but JMH already
* handles that for you.
*
*/

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_17_SyncIterations.class.getSimpleName())
.warmupTime(TimeValue.seconds(1))
.measurementTime(TimeValue.seconds(1))
.threads(Runtime.getRuntime().availableProcessors()*16)
.forks(1)
.syncIterations(true) // try to switch to "false"
.build();

new Runner(opt).run();
}

}

这个样例想要表达的内容主要在注释中。实践表明如果使用多线程来执行基准测试,工作线程的开始和结束方式将严重影响性能表现。通常的做法是将所有线程停止在某个类似栅栏的地方让后统一放行,但是这样并不是很有效,因为这并不能保证工作线程在同一时间开始工作。更好的解决方案是引入虚假迭代,增加执行迭代的线程,然后原子性地将系统转移到执行测量方法上,在停止过程中也可以做同样的事情。这听起来很复杂,但是JMH已经处理好了,即执行选项:syncIterations。

18 Control

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
@State(Scope.Group)
public class JMHSample_18_Control {

/*
* In this example, we want to estimate the ping-pong speed for the simple
* AtomicBoolean. Unfortunately, doing that in naive manner will livelock
* one of the threads, because the executions of ping/pong are not paired
* perfectly. We need the escape hatch to terminate the loop if threads
* are about to leave the measurement.
*/

public final AtomicBoolean flag = new AtomicBoolean();

@Benchmark
@Group("pingpong")
public void ping(Control cnt) {
while (!cnt.stopMeasurement && !flag.compareAndSet(false, true)) {
// this body is intentionally left blank
}
}

@Benchmark
@Group("pingpong")
public void pong(Control cnt) {
while (!cnt.stopMeasurement && !flag.compareAndSet(true, false)) {
// this body is intentionally left blank
}
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_18_Control.class.getSimpleName())
.threads(2)
.forks(1)
.build();

new Runner(opt).run();
}

}

本样例介绍了一个实验性质的工具类Control,其用途主要是为了在条件执行的情况下能够停止基准测试方法执行,如果基准方法不停止整个测试将不会结束。上面的例子中,在同一个group内两个方法分别执行cas操作,若果没有Control的介入,在测试停止时,其中一个方法将会陷入死循环。

19 N/A

样例被删除了?

20 Annotations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
public class JMHSample_20_Annotations {

double x1 = Math.PI;

/*
* In addition to all the command line options usable at run time,
* we have the annotations which can provide the reasonable defaults
* for the some of the benchmarks. This is very useful when you are
* dealing with lots of benchmarks, and some of them require
* special treatment.
*
* Annotation can also be placed on class, to have the effect over
* all the benchmark methods in the same class. The rule is, the
* annotation in the closest scope takes the precedence: i.e.
* the method-based annotation overrides class-based annotation,
* etc.
*/

@Benchmark
@Warmup(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
public double measure() {
return Math.log(x1);
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_20_Annotations.class.getSimpleName())
.build();

new Runner(opt).run();
}

}

JMH不仅支持在运行时使用Options对象来配置执行参数,同样也支持使用注解来进行配置,包括@Measurement、@Warmup等等,大部分配置参数都能够找到对应的注解。

21 Consume CPU

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class JMHSample_21_ConsumeCPU {

/*
* At times you require the test to burn some of the cycles doing nothing.
* In many cases, you *do* want to burn the cycles instead of waiting.
*
* For these occasions, we have the infrastructure support. Blackholes
* can not only consume the values, but also the time! Run this test
* to get familiar with this part of JMH.
*
* (Note we use static method because most of the use cases are deep
* within the testing code, and propagating blackholes is tedious).
*/

@Benchmark
public void consume_0000() {
Blackhole.consumeCPU(0);
}

@Benchmark
public void consume_0001() {
Blackhole.consumeCPU(1);
}

@Benchmark
public void consume_0002() {
Blackhole.consumeCPU(2);
}

@Benchmark
public void consume_0004() {
Blackhole.consumeCPU(4);
}

@Benchmark
public void consume_0008() {
Blackhole.consumeCPU(8);
}

@Benchmark
public void consume_0016() {
Blackhole.consumeCPU(16);
}

@Benchmark
public void consume_0032() {
Blackhole.consumeCPU(32);
}

@Benchmark
public void consume_0064() {
Blackhole.consumeCPU(64);
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_21_ConsumeCPU.class.getSimpleName())
.forks(1)
.build();

new Runner(opt).run();
}

}

介绍了一种“空转”的方法,有时候可能就是需要消耗掉一部分性能,可以使用Blockhole的静态方法来快速实现这个目的。

22 False Sharing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(5)
public class JMHSample_22_FalseSharing {

/*
* Suppose we have two threads:
* a) innocuous reader which blindly reads its own field
* b) furious writer which updates its own field
*/

/*
* BASELINE EXPERIMENT:
* Because of the false sharing, both reader and writer will experience
* penalties.
*/

@State(Scope.Group)
public static class StateBaseline {
int readOnly;
int writeOnly;
}

@Benchmark
@Group("baseline")
public int reader(StateBaseline s) {
return s.readOnly;
}

@Benchmark
@Group("baseline")
public void writer(StateBaseline s) {
s.writeOnly++;
}

/*
* APPROACH 1: PADDING
*
* We can try to alleviate some of the effects with padding.
* This is not versatile because JVMs can freely rearrange the
* field order, even of the same type.
*/

@State(Scope.Group)
public static class StatePadded {
int readOnly;
int p01, p02, p03, p04, p05, p06, p07, p08;
int p11, p12, p13, p14, p15, p16, p17, p18;
int writeOnly;
int q01, q02, q03, q04, q05, q06, q07, q08;
int q11, q12, q13, q14, q15, q16, q17, q18;
}

@Benchmark
@Group("padded")
public int reader(StatePadded s) {
return s.readOnly;
}

@Benchmark
@Group("padded")
public void writer(StatePadded s) {
s.writeOnly++;
}

/*
* APPROACH 2: CLASS HIERARCHY TRICK
*
* We can alleviate false sharing with this convoluted hierarchy trick,
* using the fact that superclass fields are usually laid out first.
* In this construction, the protected field will be squashed between
* paddings.
* It is important to use the smallest data type, so that layouter would
* not generate any gaps that can be taken by later protected subclasses
* fields. Depending on the actual field layout of classes that bear the
* protected fields, we might need more padding to account for "lost"
* padding fields pulled into in their superclass gaps.
*/

public static class StateHierarchy_1 {
int readOnly;
}

public static class StateHierarchy_2 extends StateHierarchy_1 {
byte p01, p02, p03, p04, p05, p06, p07, p08;
byte p11, p12, p13, p14, p15, p16, p17, p18;
byte p21, p22, p23, p24, p25, p26, p27, p28;
byte p31, p32, p33, p34, p35, p36, p37, p38;
byte p41, p42, p43, p44, p45, p46, p47, p48;
byte p51, p52, p53, p54, p55, p56, p57, p58;
byte p61, p62, p63, p64, p65, p66, p67, p68;
byte p71, p72, p73, p74, p75, p76, p77, p78;
}

public static class StateHierarchy_3 extends StateHierarchy_2 {
int writeOnly;
}

public static class StateHierarchy_4 extends StateHierarchy_3 {
byte q01, q02, q03, q04, q05, q06, q07, q08;
byte q11, q12, q13, q14, q15, q16, q17, q18;
byte q21, q22, q23, q24, q25, q26, q27, q28;
byte q31, q32, q33, q34, q35, q36, q37, q38;
byte q41, q42, q43, q44, q45, q46, q47, q48;
byte q51, q52, q53, q54, q55, q56, q57, q58;
byte q61, q62, q63, q64, q65, q66, q67, q68;
byte q71, q72, q73, q74, q75, q76, q77, q78;
}

@State(Scope.Group)
public static class StateHierarchy extends StateHierarchy_4 {
}

@Benchmark
@Group("hierarchy")
public int reader(StateHierarchy s) {
return s.readOnly;
}

@Benchmark
@Group("hierarchy")
public void writer(StateHierarchy s) {
s.writeOnly++;
}

/*
* APPROACH 3: ARRAY TRICK
*
* This trick relies on the contiguous allocation of an array.
* Instead of placing the fields in the class, we mangle them
* into the array at very sparse offsets.
*/

@State(Scope.Group)
public static class StateArray {
int[] arr = new int[128];
}

@Benchmark
@Group("sparse")
public int reader(StateArray s) {
return s.arr[0];
}

@Benchmark
@Group("sparse")
public void writer(StateArray s) {
s.arr[64]++;
}

/*
* APPROACH 4:
*
* @Contended (since JDK 8):
* Uncomment the annotation if building with JDK 8.
* Remember to flip -XX:-RestrictContended to enable.
*/

@State(Scope.Group)
public static class StateContended {
int readOnly;

// @sun.misc.Contended
int writeOnly;
}

@Benchmark
@Group("contended")
public int reader(StateContended s) {
return s.readOnly;
}

@Benchmark
@Group("contended")
public void writer(StateContended s) {
s.writeOnly++;
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_22_FalseSharing.class.getSimpleName())
.threads(Runtime.getRuntime().availableProcessors())
.build();

new Runner(opt).run();
}

}

伪共享是并发编程中常见的问题,缓存系统中是以缓存行(cache line)为单位存储的,当多线程修改互相独立的变量时,如果这些变量共享同一个缓存行会因为缓存失效导致性能下降。这个问题在微基准测试中同样不能忽略。这个样例给出了解决这个问题的几种方法:

  1. 字段填充:额外定义多个字段来填补缓存行
  2. 类继承:也是填充的一种,将多余字段定义在父类里
  3. 数组填充:定义一个较长的数组,有效数据的间隔大于缓存行大小
  4. 注解:JDK 8提供了@Contended注解来告诉编译器被注解的字段需要填充

23 Aux Counters

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class JMHSample_23_AuxCounters {

/*
* In some weird cases you need to get the separate throughput/time
* metrics for the benchmarked code depending on the outcome of the
* current code. Trying to accommodate the cases like this, JMH optionally
* provides the special annotation which treats @State objects
* as the object bearing user counters. See @AuxCounters javadoc for
* the limitations.
*/

@State(Scope.Thread)
@AuxCounters(AuxCounters.Type.OPERATIONS)
public static class OpCounters {
// These fields would be counted as metrics
public int case1;
public int case2;

// This accessor will also produce a metric
public int total() {
return case1 + case2;
}
}

@State(Scope.Thread)
@AuxCounters(AuxCounters.Type.EVENTS)
public static class EventCounters {
// This field would be counted as metric
public int wows;
}

/*
* This code measures the "throughput" in two parts of the branch.
* The @AuxCounters state above holds the counters which we increment
* ourselves, and then let JMH to use their values in the performance
* calculations.
*/

@Benchmark
public void splitBranch(OpCounters counters) {
if (Math.random() < 0.1) {
counters.case1++;
} else {
counters.case2++;
}
}

@Benchmark
public void runSETI(EventCounters counters) {
float random = (float) Math.random();
float wowSignal = (float) Math.PI / 4;
if (random == wowSignal) {
// WOW, that's unusual.
counters.wows++;
}
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_23_AuxCounters.class.getSimpleName())
.build();

new Runner(opt).run();
}

}

辅助计数器,不是很常见,就直接翻译一下吧:在一些特殊的情况下,你需要根据当前代码执行的结果来区分获取的吞吐量/时间指标。 为了应对这种情况,JMH提供了特殊的注释,将@State对象视为承载用户计数器的对象。 有关限制请参阅@AuxCounters的javadoc。

24 Inheritance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
public class JMHSample_24_Inheritance {

/*
* In very special circumstances, you might want to provide the benchmark
* body in the (abstract) superclass, and specialize it with the concrete
* pieces in the subclasses.
*
* The rule of thumb is: if some class has @Benchmark method, then all the subclasses
* are also having the "synthetic" @Benchmark method. The caveat is, because we only
* know the type hierarchy during the compilation, it is only possible during
* the same compilation session. That is, mixing in the subclass extending your
* benchmark class *after* the JMH compilation would have no effect.
*
* Note how annotations now have two possible places. The closest annotation
* in the hierarchy wins.
*/

@BenchmarkMode(Mode.AverageTime)
@Fork(1)
@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public static abstract class AbstractBenchmark {
int x;

@Setup
public void setup() {
x = 42;
}

@Benchmark
@Warmup(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
public double bench() {
return doWork() * doWork();
}

protected abstract double doWork();
}

public static class BenchmarkLog extends AbstractBenchmark {
@Override
protected double doWork() {
return Math.log(x);
}
}

public static class BenchmarkSin extends AbstractBenchmark {
@Override
protected double doWork() {
return Math.sin(x);
}
}

public static class BenchmarkCos extends AbstractBenchmark {
@Override
protected double doWork() {
return Math.cos(x);
}
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_24_Inheritance.class.getSimpleName())
.build();

new Runner(opt).run();
}

}

JMH允许使用继承,你可以在抽象父类中使用注解来配置基准测试并且提供一些需要实现的抽象方法。@Benchmark这个注解是可以被继承的,所有子类都会具有父类的基准测试方法。值得注意的是,由于这是编译期才能知道的关系,因此需要注意JMH编译阶段。此外,注解的生效规则是在继承树中最近的注解将会生效。

25 API GA

这个样例有些复杂,不是很懂,先不谈了好吧。。

26 Batch Size

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
@State(Scope.Thread)
public class JMHSample_26_BatchSize {

/*
* Suppose we want to measure insertion in the middle of the list.
*/

List<String> list = new LinkedList<>();

@Benchmark
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@BenchmarkMode(Mode.AverageTime)
public List<String> measureWrong_1() {
list.add(list.size() / 2, "something");
return list;
}

@Benchmark
@Warmup(iterations = 5, time = 5)
@Measurement(iterations = 5, time = 5)
@BenchmarkMode(Mode.AverageTime)
public List<String> measureWrong_5() {
list.add(list.size() / 2, "something");
return list;
}

/*
* This is what you do with JMH.
*/
@Benchmark
@Warmup(iterations = 5, batchSize = 5000)
@Measurement(iterations = 5, batchSize = 5000)
@BenchmarkMode(Mode.SingleShotTime)
public List<String> measureRight() {
list.add(list.size() / 2, "something");
return list;
}

@Setup(Level.Iteration)
public void setup(){
list.clear();
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_26_BatchSize.class.getSimpleName())
.forks(1)
.build();

new Runner(opt).run();
}

}

如果测试方法的执行效率并不是稳定的,即每次执行测试都存在较大的差别,在这种情况下以固定时间范围执行测试是不可行的,因此必须选用Mode.SingleShotTime。但是与此同时只执行一次对于该操作来说无法得到可信赖的测试结果,此时就可以选择使用batchSize参数。

对于上面的例子来说,所做的事情是在测试在链表中间插入对象,这个操作受到链表长度的影响,因此效率不是稳定的。为了达到每次测试的执行环境等价,需要执行固定次数,所以对于measureRight这个正确基准测试方法的行为可描述为:迭代5轮,每轮执行一次,每次调用5000次测试方法。

27 Params

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Benchmark)
public class JMHSample_27_Params {

/**
* In many cases, the experiments require walking the configuration space
* for a benchmark. This is needed for additional control, or investigating
* how the workload performance changes with different settings.
*/

@Param({"1", "31", "65", "101", "103"})
public int arg;

@Param({"0", "1", "2", "4", "8", "16", "32"})
public int certainty;

@Benchmark
public boolean bench() {
return BigInteger.valueOf(arg).isProbablePrime(certainty);
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_27_Params.class.getSimpleName())
// .param("arg", "41", "42") // Use this to selectively constrain/override parameters
.build();

new Runner(opt).run();
}

}

这个例子比较好理解也很实用,很多时候需要比较不同配置参数下的测试结果,JMH也提供了多参数执行的能力,你可以通过@Param注解和param配置项来给出参数候选项。注意在有多个测试参数且都包含多个候选项的情况下,JMH会执行所有参数的排列组合。

28 Blackhole Helpers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Thread)
public class JMHSample_28_BlackholeHelpers {

/**
* Sometimes you need the black hole not in @Benchmark method, but in
* helper methods, because you want to pass it through to the concrete
* implementation which is instantiated in helper methods. In this case,
* you can request the black hole straight in the helper method signature.
* This applies to both @Setup and @TearDown methods, and also to other
* JMH infrastructure objects, like Control.
*
* Below is the variant of {@link org.openjdk.jmh.samples.JMHSample_08_DeadCode}
* test, but wrapped in the anonymous classes.
*/

public interface Worker {
void work();
}

private Worker workerBaseline;
private Worker workerRight;
private Worker workerWrong;

@Setup
public void setup(final Blackhole bh) {
workerBaseline = new Worker() {
double x;

@Override
public void work() {
// do nothing
}
};

workerWrong = new Worker() {
double x;

@Override
public void work() {
Math.log(x);
}
};

workerRight = new Worker() {
double x;

@Override
public void work() {
bh.consume(Math.log(x));
}
};

}

@Benchmark
public void baseline() {
workerBaseline.work();
}

@Benchmark
public void measureWrong() {
workerWrong.work();
}

@Benchmark
public void measureRight() {
workerRight.work();
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_28_BlackholeHelpers.class.getSimpleName())
.build();

new Runner(opt).run();
}

}

这个样例表明你可以在一些辅助方法中使用、保存Blackhole对象。在样例中,Setup方法的方法参数带有Blackhole,并以此对接口进行了不同的实现。这种注入能力对于一些其他JMH基础工具同样适用,比如Control。

29 States DAG

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Thread)
public class JMHSample_29_StatesDAG {

/**
* WARNING:
* THIS IS AN EXPERIMENTAL FEATURE, BE READY FOR IT BECOME REMOVED WITHOUT NOTICE!
*/

/*
* This is a model case, and it might not be a good benchmark.
* // TODO: Replace it with the benchmark which does something useful.
*/

public static class Counter {
int x;

public int inc() {
return x++;
}

public void dispose() {
// pretend this is something really useful
}
}

/*
* Shared state maintains the set of Counters, and worker threads should
* poll their own instances of Counter to work with. However, it should only
* be done once, and therefore, Local state caches it after requesting the
* counter from Shared state.
*/

@State(Scope.Benchmark)
public static class Shared {
List<Counter> all;
Queue<Counter> available;

@Setup
public synchronized void setup() {
all = new ArrayList<>();
for (int c = 0; c < 10; c++) {
all.add(new Counter());
}

available = new LinkedList<>();
available.addAll(all);
}

@TearDown
public synchronized void tearDown() {
for (Counter c : all) {
c.dispose();
}
}

public synchronized Counter getMine() {
return available.poll();
}
}

@State(Scope.Thread)
public static class Local {
Counter cnt;

@Setup
public void setup(Shared shared) {
cnt = shared.getMine();
}
}

@Benchmark
public int test(Local local) {
return local.cnt.inc();
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_29_StatesDAG.class.getSimpleName())
.build();

new Runner(opt).run();
}


}

本例描述的是State对象存在依赖关系的情况,JMH允许各个State对象存在DAG(有向无环图)形式的依赖关系。在例子中Thread Scope的Local对象依赖Benchmark Scope的Shared对象,每个Local对象都会从Shared对象的队列成员中取出专属的Counter。这是个实验性质的特性,不是很常用,简单了解即可。

30 Interrupts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Group)
public class JMHSample_30_Interrupts {

/*
* In this example, we want to measure the simple performance characteristics
* of the ArrayBlockingQueue. Unfortunately, doing that without a harness
* support will deadlock one of the threads, because the executions of
* take/put are not paired perfectly. Fortunately for us, both methods react
* to interrupts well, and therefore we can rely on JMH to terminate the
* measurement for us. JMH will notify users about the interrupt actions
* nevertheless, so users can see if those interrupts affected the measurement.
* JMH will start issuing interrupts after the default or user-specified timeout
* had been reached.
*
* This is a variant of org.openjdk.jmh.samples.JMHSample_18_Control, but without
* the explicit control objects. This example is suitable for the methods which
* react to interrupts gracefully.
*/

private BlockingQueue<Integer> q;

@Setup
public void setup() {
q = new ArrayBlockingQueue<>(1);
}

@Group("Q")
@Benchmark
public Integer take() throws InterruptedException {
return q.take();
}

@Group("Q")
@Benchmark
public void put() throws InterruptedException {
q.put(42);
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_30_Interrupts.class.getSimpleName())
.threads(2)
.forks(5)
.timeout(TimeValue.seconds(10))
.build();

new Runner(opt).run();
}

}

JMH能够给Benchmark方法设值超时时间,在超时后主动interrupt方法调用。上面的例子与样例18类似但是没有Control对象来控制,因此在测试进入停止阶段时会有某个方法block住。JMH会在默认或设置的超时时间到达时进行打断并提示用户进行了打断操作,方便用户判断打断是否影响测试结果。

31 Infra Params

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Benchmark)
public class JMHSample_31_InfraParams {

/*
* There is a way to query JMH about the current running mode. This is
* possible with three infrastructure objects we can request to be injected:
* - BenchmarkParams: covers the benchmark-global configuration
* - IterationParams: covers the current iteration configuration
* - ThreadParams: covers the specifics about threading
*
* Suppose we want to check how the ConcurrentHashMap scales under different
* parallelism levels. We can put concurrencyLevel in @Param, but it sometimes
* inconvenient if, say, we want it to follow the @Threads count. Here is
* how we can query JMH about how many threads was requested for the current run,
* and put that into concurrencyLevel argument for CHM constructor.
*/

static final int THREAD_SLICE = 1000;

private ConcurrentHashMap<String, String> mapSingle;
private ConcurrentHashMap<String, String> mapFollowThreads;

@Setup
public void setup(BenchmarkParams params) {
int capacity = 16 * THREAD_SLICE * params.getThreads();
mapSingle = new ConcurrentHashMap<>(capacity, 0.75f, 1);
mapFollowThreads = new ConcurrentHashMap<>(capacity, 0.75f, params.getThreads());
}

/*
* Here is another neat trick. Generate the distinct set of keys for all threads:
*/

@State(Scope.Thread)
public static class Ids {
private List<String> ids;

@Setup
public void setup(ThreadParams threads) {
ids = new ArrayList<>();
for (int c = 0; c < THREAD_SLICE; c++) {
ids.add("ID" + (THREAD_SLICE * threads.getThreadIndex() + c));
}
}
}

@Benchmark
public void measureDefault(Ids ids) {
for (String s : ids.ids) {
mapSingle.remove(s);
mapSingle.put(s, s);
}
}

@Benchmark
public void measureFollowThreads(Ids ids) {
for (String s : ids.ids) {
mapFollowThreads.remove(s);
mapFollowThreads.put(s, s);
}
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_31_InfraParams.class.getSimpleName())
.threads(4)
.forks(5)
.build();

new Runner(opt).run();
}

}

JMH提供了一些能够在运行时查询当前配置的工具类,方便在代码逻辑中根据配置进行操作。主要包括三个参数对象:BenchmarkParams、IterationParams、ThreadParams,这个应该无需多解释,字面意思。

32 Bulk Warmup

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class JMHSample_32_BulkWarmup {

/*
* This is an addendum to JMHSample_12_Forking test.
*
* Sometimes you want an opposite configuration: instead of separating the profiles
* for different benchmarks, you want to mix them together to test the worst-case
* scenario.
*
* JMH has a bulk warmup feature for that: it does the warmups for all the tests
* first, and then measures them. JMH still forks the JVM for each test, but once the
* new JVM has started, all the warmups are being run there, before running the
* measurement. This helps to dodge the type profile skews, as each test is still
* executed in a different JVM, and we only "mix" the warmup code we want.
*/

/*
* These test classes are borrowed verbatim from JMHSample_12_Forking.
*/

public interface Counter {
int inc();
}

public static class Counter1 implements Counter {
private int x;

@Override
public int inc() {
return x++;
}
}

public static class Counter2 implements Counter {
private int x;

@Override
public int inc() {
return x++;
}
}

Counter c1 = new Counter1();
Counter c2 = new Counter2();

/*
* And this is our test payload. Notice we have to break the inlining of the payload,
* so that in could not be inlined in either measure_c1() or measure_c2() below, and
* specialized for that only call.
*/

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int measure(Counter c) {
int s = 0;
for (int i = 0; i < 10; i++) {
s += c.inc();
}
return s;
}

@Benchmark
public int measure_c1() {
return measure(c1);
}

@Benchmark
public int measure_c2() {
return measure(c2);
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_32_BulkWarmup.class.getSimpleName())
// .includeWarmup(...) <-- this may include other benchmarks into warmup
.warmupMode(WarmupMode.BULK) // see other WarmupMode.* as well
.forks(1)
.build();

new Runner(opt).run();
}

}

这是对样例12的补充,在样例12中我们知道为了不影响JVM的PGO优化,JMH会默认Fork进程使每个基准测试方法在独立的JVM中预热、执行。但是也有可能用户就是想测试在混杂执行的情况下的执行情况,此时可以通过设置warmupMode为WarmupMode.BULK来控制JMH运行所有方法的预热后再执行相关基准测试方法。注意JMH仍然会为每个方法Fork进程,只是每个进程开始执行时的预热行为发生了改变。

33 Security Manager

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class JMHSample_33_SecurityManager {

/*
* Some targeted tests may care about SecurityManager being installed.
* Since JMH itself needs to do privileged actions, it is not enough
* to blindly install the SecurityManager, as JMH infrastructure will fail.
*/

/*
* In this example, we want to measure the performance of System.getProperty
* with SecurityManager installed or not. To do this, we have two state classes
* with helper methods. One that reads the default JMH security policy (we ship one
* with JMH), and installs the security manager; another one that makes sure
* the SecurityManager is not installed.
*
* If you need a restricted security policy for the tests, you are advised to
* get /jmh-security-minimal.policy, that contains the minimal permissions
* required for JMH benchmark to run, merge the new permissions there, produce new
* policy file in a temporary location, and load that policy file instead.
* There is also /jmh-security-minimal-runner.policy, that contains the minimal
* permissions for the JMH harness to run, if you want to use JVM args to arm
* the SecurityManager.
*/

@State(Scope.Benchmark)
public static class SecurityManagerInstalled {
@Setup
public void setup() throws IOException, NoSuchAlgorithmException, URISyntaxException {
URI policyFile = JMHSample_33_SecurityManager.class.getResource("/jmh-security.policy").toURI();
Policy.setPolicy(Policy.getInstance("JavaPolicy", new URIParameter(policyFile)));
System.setSecurityManager(new SecurityManager());
}

@TearDown
public void tearDown() {
System.setSecurityManager(null);
}
}

@State(Scope.Benchmark)
public static class SecurityManagerEmpty {
@Setup
public void setup() throws IOException, NoSuchAlgorithmException, URISyntaxException {
System.setSecurityManager(null);
}
}

@Benchmark
public String testWithSM(SecurityManagerInstalled s) throws InterruptedException {
return System.getProperty("java.home");
}

@Benchmark
public String testWithoutSM(SecurityManagerEmpty s) throws InterruptedException {
return System.getProperty("java.home");
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_33_SecurityManager.class.getSimpleName())
.warmupIterations(5)
.measurementIterations(5)
.forks(1)
.build();

new Runner(opt).run();
}

}

这个样例是关于安全方面的说明,Java安全主要依靠Security Manager。样例给出了指定安全策略以及无安全管理的测试对比方式。

34 Safe Looping

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
@State(Scope.Thread)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class JMHSample_34_SafeLooping {

/*
* JMHSample_11_Loops warns about the dangers of using loops in @Benchmark methods.
* Sometimes, however, one needs to traverse through several elements in a dataset.
* This is hard to do without loops, and therefore we need to devise a scheme for
* safe looping.
*/

/*
* Suppose we want to measure how much it takes to execute work() with different
* arguments. This mimics a frequent use case when multiple instances with the same
* implementation, but different data, is measured.
*/

static final int BASE = 42;

static int work(int x) {
return BASE + x;
}

/*
* Every benchmark requires control. We do a trivial control for our benchmarks
* by checking the benchmark costs are growing linearly with increased task size.
* If it doesn't, then something wrong is happening.
*/

@Param({"1", "10", "100", "1000"})
int size;

int[] xs;

@Setup
public void setup() {
xs = new int[size];
for (int c = 0; c < size; c++) {
xs[c] = c;
}
}

/*
* First, the obviously wrong way: "saving" the result into a local variable would not
* work. A sufficiently smart compiler will inline work(), and figure out only the last
* work() call needs to be evaluated. Indeed, if you run it with varying $size, the score
* will stay the same!
*/

@Benchmark
public int measureWrong_1() {
int acc = 0;
for (int x : xs) {
acc = work(x);
}
return acc;
}

/*
* Second, another wrong way: "accumulating" the result into a local variable. While
* it would force the computation of each work() method, there are software pipelining
* effects in action, that can merge the operations between two otherwise distinct work()
* bodies. This will obliterate the benchmark setup.
*
* In this example, HotSpot does the unrolled loop, merges the $BASE operands into a single
* addition to $acc, and then does a bunch of very tight stores of $x-s. The final performance
* depends on how much of the loop unrolling happened *and* how much data is available to make
* the large strides.
*/

@Benchmark
public int measureWrong_2() {
int acc = 0;
for (int x : xs) {
acc += work(x);
}
return acc;
}

/*
* Now, let's see how to measure these things properly. A very straight-forward way to
* break the merging is to sink each result to Blackhole. This will force runtime to compute
* every work() call in full. (We would normally like to care about several concurrent work()
* computations at once, but the memory effects from Blackhole.consume() prevent those optimization
* on most runtimes).
*/

@Benchmark
public void measureRight_1(Blackhole bh) {
for (int x : xs) {
bh.consume(work(x));
}
}

/*
* DANGEROUS AREA, PLEASE READ THE DESCRIPTION BELOW.
*
* Sometimes, the cost of sinking the value into a Blackhole is dominating the nano-benchmark score.
* In these cases, one may try to do a make-shift "sinker" with non-inlineable method. This trick is
* *very* VM-specific, and can only be used if you are verifying the generated code (that's a good
* strategy when dealing with nano-benchmarks anyway).
*
* You SHOULD NOT use this trick in most cases. Apply only where needed.
*/

@Benchmark
public void measureRight_2() {
for (int x : xs) {
sink(work(x));
}
}

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public static void sink(int v) {
// IT IS VERY IMPORTANT TO MATCH THE SIGNATURE TO AVOID AUTOBOXING.
// The method intentionally does nothing.
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JMHSample_34_SafeLooping.class.getSimpleName())
.forks(3)
.build();

new Runner(opt).run();
}

}

这个样例是对样例11的补充,通过样例11我们知道在编写测试方法时不应该手动执行循环而应该让JMH在方法调用层面进行操作,但是有时循环无法避免,比如测试查询数据库后遍历获取的数据列表,此时循环是测试方法不可分离的一部分。针对这种情况,上述示例代码给出了错误和正确的处理方法。首先直白的循环一定是错误的,JVM会执行内联、推断、简化等各种操作使得代码块“失效”,最方便的处理方式是在循环内使用Blackhole对象,但是如果Blackhole的方法调用占据了基准测试的大部分时间那也无法得到正确的测试结果,此时可以考虑定义一个阻止内联的空方法来代替Blackhole,但是这个操作非常vm-specific,只有必要的时候才应该使用,请详细阅读上面代码中的相关注释说明。

35 Profilers

JMH提供了一些非常方便的分析器,可以帮助用户了解基准测试的细节信息。 虽然这些分析器不能替代成熟的外部分析器,但在许多情况下它们可以方便快速地深入研究基准行为。 当你在对基准代码本身进行不断地调整时,快速获得结果非常重要。这个例子中给出了许多分析器的执行结果说明,示例内容比较长,请直接在Github中查看

36 Branch Prediction

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(5)
@State(Scope.Benchmark)
public class JMHSample_36_BranchPrediction {

/*
* This sample serves as a warning against regular data sets.
*
* It is very tempting to present a regular data set to benchmark, either due to
* naive generation strategy, or just from feeling better about regular data sets.
* Unfortunately, it frequently backfires: the regular datasets are known to be
* optimized well by software and hardware. This example exploits one of these
* optimizations: branch prediction.
*
* Imagine our benchmark selects the branch based on the array contents, as
* we are streaming through it:
*/

private static final int COUNT = 1024 * 1024;

private byte[] sorted;
private byte[] unsorted;

@Setup
public void setup() {
sorted = new byte[COUNT];
unsorted = new byte[COUNT];
Random random = new Random(1234);
random.nextBytes(sorted);
random.nextBytes(unsorted);
Arrays.sort(sorted);
}

@Benchmark
@OperationsPerInvocation(COUNT)
public void sorted(Blackhole bh1, Blackhole bh2) {
for (byte v : sorted) {
if (v > 0) {
bh1.consume(v);
} else {
bh2.consume(v);
}
}
}

@Benchmark
@OperationsPerInvocation(COUNT)
public void unsorted(Blackhole bh1, Blackhole bh2) {
for (byte v : unsorted) {
if (v > 0) {
bh1.consume(v);
} else {
bh2.consume(v);
}
}
}

/*
There is a substantial difference in performance for these benchmarks!
It is explained by good branch prediction in "sorted" case, and branch mispredicts in "unsorted"
case. -prof perfnorm conveniently highlights that, with larger "branch-misses", and larger "CPI"
for "unsorted" case:
Benchmark Mode Cnt Score Error Units
JMHSample_36_BranchPrediction.sorted avgt 25 2.160 ± 0.049 ns/op
JMHSample_36_BranchPrediction.sorted:·CPI avgt 5 0.286 ± 0.025 #/op
JMHSample_36_BranchPrediction.sorted:·branch-misses avgt 5 ≈ 10⁻⁴ #/op
JMHSample_36_BranchPrediction.sorted:·branches avgt 5 7.606 ± 1.742 #/op
JMHSample_36_BranchPrediction.sorted:·cycles avgt 5 8.998 ± 1.081 #/op
JMHSample_36_BranchPrediction.sorted:·instructions avgt 5 31.442 ± 4.899 #/op
JMHSample_36_BranchPrediction.unsorted avgt 25 5.943 ± 0.018 ns/op
JMHSample_36_BranchPrediction.unsorted:·CPI avgt 5 0.775 ± 0.052 #/op
JMHSample_36_BranchPrediction.unsorted:·branch-misses avgt 5 0.529 ± 0.026 #/op <--- OOPS
JMHSample_36_BranchPrediction.unsorted:·branches avgt 5 7.841 ± 0.046 #/op
JMHSample_36_BranchPrediction.unsorted:·cycles avgt 5 24.793 ± 0.434 #/op
JMHSample_36_BranchPrediction.unsorted:·instructions avgt 5 31.994 ± 2.342 #/op
It is an open question if you want to measure only one of these tests. In many cases, you have to measure
both to get the proper best-case and worst-case estimate!
*/

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(".*" + JMHSample_36_BranchPrediction.class.getSimpleName() + ".*")
.build();

new Runner(opt).run();
}

}

这个样例表述的内容与JVM的一个优化功能相关:分支预测,发生这种类型的问题主要是由规整的数据集造成。在编写代码时很有可能因为简单的生成规则或者代码美感偏向之类的原因导致数据非常规整,但这恰恰会适得其反。众所周知,规则的数据集可以被软件或硬件良好地优化,而分支预测正是其中一种手段。

在代码例子中给出了两种byte数组数据集合:乱序的和有序的,然后对其分别执行逻辑相同的测试方法:循环数组根据元素是否大于0来执行对应的代码块。显然,排序的数组在正负分界点前后只会执行固定的代码块,这使得JVM可以利用这一点进行优化。

37 Cache Access

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(5)
@State(Scope.Benchmark)
public class JMHSample_37_CacheAccess {

/*
* This sample serves as a warning against subtle differences in cache access patterns.
*
* Many performance differences may be explained by the way tests are accessing memory.
* In the example below, we walk the matrix either row-first, or col-first:
*/

private final static int COUNT = 4096;
private final static int MATRIX_SIZE = COUNT * COUNT;

private int[][] matrix;

@Setup
public void setup() {
matrix = new int[COUNT][COUNT];
Random random = new Random(1234);
for (int i = 0; i < COUNT; i++) {
for (int j = 0; j < COUNT; j++) {
matrix[i][j] = random.nextInt();
}
}
}

@Benchmark
@OperationsPerInvocation(MATRIX_SIZE)
public void colFirst(Blackhole bh) {
for (int c = 0; c < COUNT; c++) {
for (int r = 0; r < COUNT; r++) {
bh.consume(matrix[r][c]);
}
}
}

@Benchmark
@OperationsPerInvocation(MATRIX_SIZE)
public void rowFirst(Blackhole bh) {
for (int r = 0; r < COUNT; r++) {
for (int c = 0; c < COUNT; c++) {
bh.consume(matrix[r][c]);
}
}
}

/*
Notably, colFirst accesses are much slower, and that's not a surprise: Java's multidimensional
arrays are actually rigged, being one-dimensional arrays of one-dimensional arrays. Therefore,
pulling n-th element from each of the inner array induces more cache misses, when matrix is large.
-prof perfnorm conveniently highlights that, with >2 cache misses per one benchmark op:
Benchmark Mode Cnt Score Error Units
JMHSample_37_MatrixCopy.colFirst avgt 25 5.306 ± 0.020 ns/op
JMHSample_37_MatrixCopy.colFirst:·CPI avgt 5 0.621 ± 0.011 #/op
JMHSample_37_MatrixCopy.colFirst:·L1-dcache-load-misses avgt 5 2.177 ± 0.044 #/op <-- OOPS
JMHSample_37_MatrixCopy.colFirst:·L1-dcache-loads avgt 5 14.804 ± 0.261 #/op
JMHSample_37_MatrixCopy.colFirst:·LLC-loads avgt 5 2.165 ± 0.091 #/op
JMHSample_37_MatrixCopy.colFirst:·cycles avgt 5 22.272 ± 0.372 #/op
JMHSample_37_MatrixCopy.colFirst:·instructions avgt 5 35.888 ± 1.215 #/op
JMHSample_37_MatrixCopy.rowFirst avgt 25 2.662 ± 0.003 ns/op
JMHSample_37_MatrixCopy.rowFirst:·CPI avgt 5 0.312 ± 0.003 #/op
JMHSample_37_MatrixCopy.rowFirst:·L1-dcache-load-misses avgt 5 0.066 ± 0.001 #/op
JMHSample_37_MatrixCopy.rowFirst:·L1-dcache-loads avgt 5 14.570 ± 0.400 #/op
JMHSample_37_MatrixCopy.rowFirst:·LLC-loads avgt 5 0.002 ± 0.001 #/op
JMHSample_37_MatrixCopy.rowFirst:·cycles avgt 5 11.046 ± 0.343 #/op
JMHSample_37_MatrixCopy.rowFirst:·instructions avgt 5 35.416 ± 1.248 #/op
So, when comparing two different benchmarks, you have to follow up if the difference is caused
by the memory locality issues.
*/

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(".*" + JMHSample_37_CacheAccess.class.getSimpleName() + ".*")
.build();

new Runner(opt).run();
}

}

这个样例讨论的并不是JMH本身,而是说明缓存读取形式所带来的影响,很多时候性能上的差别都可以通过访问内存方式的差别来解释。样例中使用遍历矩阵来说明了这个观点,两个基准测试方法分别通过行优先和列优先的形式来遍历矩阵,得到的结果是列优先的遍历方式明显会更慢一些。

38 Per Invoke Setup

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(5)
public class JMHSample_38_PerInvokeSetup {

/*
* This example highlights the usual mistake in non-steady-state benchmarks.
*
* Suppose we want to test how long it takes to bubble sort an array. Naively,
* we could make the test that populates an array with random (unsorted) values,
* and calls sort on it over and over again:
*/

private void bubbleSort(byte[] b) {
boolean changed = true;
while (changed) {
changed = false;
for (int c = 0; c < b.length - 1; c++) {
if (b[c] > b[c + 1]) {
byte t = b[c];
b[c] = b[c + 1];
b[c + 1] = t;
changed = true;
}
}
}
}

// Could be an implicit State instead, but we are going to use it
// as the dependency in one of the tests below
@State(Scope.Benchmark)
public static class Data {

@Param({"1", "16", "256"})
int count;

byte[] arr;

@Setup
public void setup() {
arr = new byte[count];
Random random = new Random(1234);
random.nextBytes(arr);
}
}

@Benchmark
public byte[] measureWrong(Data d) {
bubbleSort(d.arr);
return d.arr;
}

/*
* The method above is subtly wrong: it sorts the random array on the first invocation
* only. Every subsequent call will "sort" the already sorted array. With bubble sort,
* that operation would be significantly faster!
*
* This is how we might *try* to measure it right by making a copy in Level.Invocation
* setup. However, this is susceptible to the problems described in Level.Invocation
* Javadocs, READ AND UNDERSTAND THOSE DOCS BEFORE USING THIS APPROACH.
*/

@State(Scope.Thread)
public static class DataCopy {
byte[] copy;

@Setup(Level.Invocation)
public void setup2(Data d) {
copy = Arrays.copyOf(d.arr, d.arr.length);
}
}

@Benchmark
public byte[] measureNeutral(DataCopy d) {
bubbleSort(d.copy);
return d.copy;
}

/*
* In an overwhelming majority of cases, the only sensible thing to do is to suck up
* the per-invocation setup costs into a benchmark itself. This work well in practice,
* especially when the payload costs dominate the setup costs.
*/

@Benchmark
public byte[] measureRight(Data d) {
byte[] c = Arrays.copyOf(d.arr, d.arr.length);
bubbleSort(c);
return c;
}

/*
Benchmark (count) Mode Cnt Score Error Units
JMHSample_38_PerInvokeSetup.measureWrong 1 avgt 25 2.408 ± 0.011 ns/op
JMHSample_38_PerInvokeSetup.measureWrong 16 avgt 25 8.286 ± 0.023 ns/op
JMHSample_38_PerInvokeSetup.measureWrong 256 avgt 25 73.405 ± 0.018 ns/op
JMHSample_38_PerInvokeSetup.measureNeutral 1 avgt 25 15.835 ± 0.470 ns/op
JMHSample_38_PerInvokeSetup.measureNeutral 16 avgt 25 112.552 ± 0.787 ns/op
JMHSample_38_PerInvokeSetup.measureNeutral 256 avgt 25 58343.848 ± 991.202 ns/op
JMHSample_38_PerInvokeSetup.measureRight 1 avgt 25 6.075 ± 0.018 ns/op
JMHSample_38_PerInvokeSetup.measureRight 16 avgt 25 102.390 ± 0.676 ns/op
JMHSample_38_PerInvokeSetup.measureRight 256 avgt 25 58812.411 ± 997.951 ns/op
We can clearly see that "measureWrong" provides a very weird result: it "sorts" way too fast.
"measureNeutral" is neither good or bad: while it prepares the data for each invocation correctly,
the timing overheads are clearly visible. These overheads can be overwhelming, depending on
the thread count and/or OS flavor.
*/


public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(".*" + JMHSample_38_PerInvokeSetup.class.getSimpleName() + ".*")
.build();

new Runner(opt).run();
}

}

最后这个样例举了一个冒泡排序的例子,由于排序操作是对数组直接进行修改且冒泡排序受到数组本身顺序的影响,因此在相同环境下重复执行排序并不是稳定的操作。很明显measureWrong方法直接进行循环排序是错误的;接下来按照直白的想法,用户通常会选择采用Level.Invocation级别的Setup操作来在每次方法调用之前对数组进行拷贝,这样操作逻辑上是没有问题的,但是Level.Invocation是一个需要小心使用的调用级别,你必须仔细阅读相关javadoc说明,这在前面的样例中也有提到,JMH并不推荐使用这种形式;最后给到的measureRight方法直接把数组拷贝放在了基准测试方法块内部,尽管看起来不太好,但是在逻辑代码执行时间占绝对主导的情况下,这是经过实践得出的最佳实践。最后样例也给出了他们实际执行的结果对比。