IBM: z/VM Performance Report: TCP/IP Device Layer MP Support

TCP/IP Device Layer MP Support

In addition to the TCP/IP performance enhancements described in section TCP/IP Stack Performance Improvements, support was added to TCP/IP 440 to allow individual device drivers to be associated with particular virtual processors. Prior to this release, TCP/IP VM didn't have any virtual MP support and, as a result, any given TCP/IP stack virtual machine could only run on one real processor at a time. With TCP/IP 440, the device-specific processing can be done on virtual processors other than the base processor. This can be used to offload some processing from the base processor, which is used by the remaining stack functions, increasing the rate of work that can be handled by the stack virtual machine before the base processor becomes fully utilized. A new option, CPU, on the DEVICE configuration statement, designates the CPU where the driver for a particular device will be dispatched. If no specification is provided or if the designated CPU is not in the configuration, the base processor, which must be CPU 0, is used.

Methodology:

This section summarizes the results of a performance evaluation comparing TCP/IP 440 with and without the device layer MP support active.

An internal tool was used to drive connect-request-response (CRR) and streaming workloads. The CRR workload had the client connecting, sending 64 bytes to the server, the server responding with 8K and the client then disconnecting. The streaming workload consisted of the client sending 20 bytes to the server and the server responding with 20MB.

The measurements were done on a 2064-109 using 2 LPARs. Each LPAR had 3 dedicated processors, 1GB of central storage and 2GB expanded storage. In the measurement environment each LPAR had an equal number of client and server virtual machines defined. The client(s) from one LPAR communicated with the server(s) on the other LPAR.

Both Gigabit Ethernet (QDIO) and HiperSockets were used for communication between the TCP/IP stacks running on each of the LPARs. For the QDIO measurements both the maximum transmission units (MTU) 1492 and 8992 were used. For HiperSockets 8K, 16K, 32K, and 56K MTU sizes were used. Performance runs were made using 1, 10, 20, and 50 client-server pairs for each workload.

Each scenario for QDIO and HiperSockets was run with CPU 0 specified and then with CPU 1 specified for the device on the TCP/IP DEVICE configuration statement for the TCP/IP stack on each LPAR. A complete set of runs, consisting of 3 trials for each case, was done. CP monitor data was captured for one of the LPARs during the measurement and reduced using VMPRF. In addition, Performance Toolkit for VM data was captured for the same LPAR and used to report information on the CPU utilization for each virtual CPU.

Results: The following tables show the comparison between results on TCP/IP 440 with (CPU 1) and without (CPU 0) the device layer MP support active for a set of the measurements taken. MB/sec (megabytes per second) or trans/sec (transactions per second) were supplied by the workload driver and shows the throughput rate. All other values are from CP monitor data, derived from CP monitor data, or from Performance Toolkit for VM data.

Table 1. QDIO - Streaming 1492

Client/Server pairs

1

10

20

50

CPU 0 - runid

qf0sn013

qf0sn101

qf0sn201

qf0sn502

MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util

61.31
12.77
8.66
4.11
48.97
48.70
0.00

74.83
17.08
11.43
5.65
68.97
69.00
0.00

77.05
18.92
12.77
6.15
74.62
74.58
0.00

77.06
20.71
14.05
6.66
79.52
79.47
0.00

CPU 1 - runid

qf1sn013

qf1sn103

qf1sn202

qf1sn502

MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util

69.32
14.98
10.13
4.85
67.50
NA
NA

77.44
19.21
13.09
6.12
85.90
NA
NA

80.41
20.37
13.92
6.45
92.31
68.95
22.85

83.87
21.35
14.70
6.65
98.81
75.87
23.33

%diff MB/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec

13.06
17.25
16.93
17.93

3.49
12.51
14.60
8.28

4.36
7.65
8.97
4.92

8.84
3.11
4.61
- 0.06

Note: 2064-109; LPAR with 3 dedicated processors

Table 2. QDIO - Streaming 8992

Client/Server pairs

1

10

20

50

CPU 0 - runid

qf0sj011

qf0sj102

qf0sj202

qf0sj501

MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util

59.79
10.23
6.37
3.86
36.36
36.04
0.00

98.04
12.21
7.53
4.68
66.92
66.78
0.00

98.14
12.41
7.61
4.80
68.97
68.74
0.00

96.06
12.87
7.90
4.97
70.51
70.56
0.00

CPU 1 - runid

qf1sj012

qf1sj102

qf1sj202

qf1sj502

MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util

67.06
12.98
8.10
4.88
53.59
38.53
15.70

105.90
14.59
9.21
5.38
90.26
65.93
24.82

108.18
12.87
8.13
4.74
81.90
62.48
19.88

106.90
13.69
8.67
5.02
85.78
70.63
20.43

%diff MB/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec

12.16
26.75
27.07
26.21

8.02
19.49
22.31
14.97

10.23
3.68
6.75
- 1.19

11.28
6.44
9.75
1.16

Note: 2064-109; LPAR with 3 dedicated processors

Table 3. QDIO - CRR 1492

Client/Server pairs

1

10

20

50

CPU 0 - runid

qf0cn013

qf0cn102

qf0cn203

qf0cn503

trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util

148.02
1.71
1.18
0.53
10.00
9.84
0.00

443.49
1.67
1.16
0.51
23.33
23.62
0.00

535.09
1.70
1.18
0.52
26.94
26.92
0.00

706.13
1.82
1.30
0.52
36.15
36.02
0.00

CPU 1 - runid

qf1cn013

qf1cn102

qf1cn202

qf1cn502

trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util

153.33
2.02
1.35
0.67
13.85
9.05
4.48

436.44
1.85
1.27
0.58
30.77
22.25
7.73

556.68
1.86
1.28
0.58
38.06
28.20
9.22

711.28
2.02
1.43
0.59
50.79
38.28
11.27

%diff trans/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec

3.59
18.37
14.85
26.24

- 1.59
11.53
9.93
15.16

4.03
9.45
8.42
11.79

0.73
10.85
9.66
13.80

Note: 2064-109; LPAR with 3 dedicated processors

Table 4. QDIO - CRR 8992

Client/Server pairs

1

10

20

50

CPU 0 - runid

qf0cj013

qf0cj101

qf0cj201

qf0cj502

trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util

146.65
1.27
0.88
0.39
10.00
9.84
0.00

453.78
1.70
1.31
0.39
23.33
23.62
0.00

522.41
1.86
1.46
0.40
26.94
26.92
0.00

716.41
1.68
1.27
0.41
36.15
36.02
0.00

CPU 1 - runid

qf1cj012

qf1cj102

qf1cj201

qf1cj501

trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util

155.80
1.40
0.94
0.46
10.00
7.09
3.07

465.32
1.95
1.50
0.45
49.74
47.16
6.04

587.90
2.03
1.57
0.46
64.10
50.56
7.98

781.14
1.85
1.37
0.48
64.87
53.02
10.54

%diff trans/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec

6.24
10.83
7.26
18.90

2.54
14.60
14.27
15.70

12.54
9.43
7.33
17.19

9.04
10.47
7.70
19.13

Note: 2064-109; LPAR with 3 dedicated processors

Table 5. HiperSocket - Streaming 8K

Client/Server pairs

1

10

20

50

CPU 0 - runid

hf0sj011

hf0sj103

hf0sj201

hf0sj501

MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util

139.12
7.57
4.51
3.06
73.30
73.45
0.00

129.58
10.17
6.16
4.01
78.61
78.48
0.00

118.43
10.77
6.54
4.23
75.56
75.40
0.00

104.12
11.90
7.38
4.52
72.86
72.66
0.00

CPU 1 - runid

hf1sj012

hf1sj103

hf1sj201

hf1sj501

MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util

163.03
8.56
5.23
3.33
104.44
75.74
28.26

160.08
9.65
5.62
4.03
115.13
89.86
25.04

138.46
10.77
6.33
4.44
109.49
87.14
21.90

127.96
11.67
6.96
4.71
106.15
86.04
20.20

%diff MB/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec

17.19
13.05
15.96
8.77

23.54
-5.04
-8.71
0.60

16.91
0.02
-3.19
5.00

22.90
-1.88
-5.60
4.17

Note: 2064-109; LPAR with 3 dedicated processors

Table 6. HiperSocket - Streaming 56K

Client/Server pairs

1

10

20

50

CPU 0 - runid

hf0s5012

hf0s5102

hf0s5201

hf0s5501

MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util

135.67
7.47
4.40
3.07
68.06
68.02
0.00

139.91
8.51
4.93
3.58
75.90
75.88
0.00

131.64
8.61
4.99
3.62
73.08
73.08
0.00

112.84
8.88
5.21
3.67
66.15
66.08
0.00

CPU 1 - runid

hf1s5013

hf1s5101

hf1s5203

hf1s5503

MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util

168.47
7.90
4.61
3.29
96.67
72.52
23.57

160.55
9.21
5.19
4.02
108.61
87.86
20.04

144.95
10.03
5.75
4.28
105.00
86.40
18.22

130.51
11.27
6.51
4.76
103.08
85.63
17.10

%diff MB/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec

24.18
5.79
4.81
7.18

14.75
8.22
5.33
12.19

10.11
16.53
15.28
18.23

15.66
26.84
24.84
29.69

Note: 2064-109; LPAR with 3 dedicated processors

Table 7. HiperSocket - CRR 8K

Client/Server pairs

1

10

20

50

CPU 0 - runid

hf0cj013

hf0cj103

hf0cj201

hf0cj501

trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util

175.93
1.33
0.94
0.39
8.97
8.62
0.00

457.63
1.56
1.17
0.39
31.94
31.60
0.00

546.08
1.47
1.08
0.39
30.83
30.62
0.00

704.48
1.74
1.32
0.42
48.72
44.30
0.00

CPU1 - runid

hf1cj012

hf1cj101

hf1cj202

hf1cj501

trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util

185.14
1.61
1.12
0.49
14.44
10.78
3.26

486.05
2.02
1.56
0.46
53.85
47.28
5.74

601.41
1.95
1.51
0.44
62.22
53.28
7.26

789.53
1.76
1.30
0.46
61.11
47.14
9.96

%diff trans/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec

5.24
20.61
19.21
23.95

6.21
29.76
33.08
19.69

10.13
32.81
39.66
13.82

12.07
1.50
-1.53
11.08

Note: 2064-109; LPAR with 3 dedicated processors

Table 8. HiperSocket - CRR 56K

Client/Server pairs

1

10

20

50

CPU 0 - runid

hf0c5013

hf0c5101

hf0c5201

hf0c5502

trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util

174.87
1.32
0.94
0.38
9.17
8.62
0.00

460.10
1.61
1.22
0.39
34.17
32.16
0.00

544.76
1.45
1.05
0.40
28.80
28.12
0.00

706.31
1.65
1.24
0.41
44.62
43.97
0.00

CPU 1 - runid

hf1c5012

hf1c5101

hf1c5201

hf1c5503

trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util

185.71
1.66
1.18
0.48
15.28
11.38
3.56

485.49
1.96
1.52
0.44
52.78
47.88
5.89

594.50
2.00
1.54
0.46
62.56
46.60
7.80

787.43
1.80
1.33
0.47
63.59
52.46
10.20

%diff trans/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec

6.20
25.96
24.98
28.40

5.52
21.63
24.67
12.14

9.13
38.50
47.10
15.81

11.49
10.94
7.51
13.74

Note: 2064-109; LPAR with 3 dedicated processors

Summary:

In general the costs per MB or transaction are higher due to the overhead for implementing the virtual MP support. However, the throughput, as reported by MB/sec or trans/sec, is greater in almost all cases measured because the stack virtual machine can now use more than one processor. In addition, overall between 10% to 30% of the workload is moved from CPU 0 (base processor) to CPU 1. The workload moved from CPU 0 to CPU 1 represents the device-specific processing which can now be done in parallel with the stack functions which must be done on the base processor. The best case scenario above is seen for Hipersocket - Streaming with an 8K MTU size. In this case the percentage of the workload moved from CPU 0 to CPU 1 ranged from 19% for 50 client-server pairs to 27% for one client-server pair. In addition, the throughput increased over 16% in all cases while the percent increase in CPU consumption ranged from a high of just over 13% with one client-server pair to a decrease of over 5% for 10 client-server pairs.