TCP/IP Device Layer MP Support
In addition to the TCP/IP performance enhancements described in section TCP/IP Stack Performance Improvements, support was added to TCP/IP 440 to allow individual device drivers to be associated with particular virtual processors. Prior to this release, TCP/IP VM didn't have any virtual MP support and, as a result, any given TCP/IP stack virtual machine could only run on one real processor at a time. With TCP/IP 440, the device-specific processing can be done on virtual processors other than the base processor. This can be used to offload some processing from the base processor, which is used by the remaining stack functions, increasing the rate of work that can be handled by the stack virtual machine before the base processor becomes fully utilized. A new option, CPU, on the DEVICE configuration statement, designates the CPU where the driver for a particular device will be dispatched. If no specification is provided or if the designated CPU is not in the configuration, the base processor, which must be CPU 0, is used.
This section summarizes the results of a performance evaluation comparing TCP/IP 440 with and without the device layer MP support active.
An internal tool was used to drive connect-request-response (CRR) and streaming workloads. The CRR workload had the client connecting, sending 64 bytes to the server, the server responding with 8K and the client then disconnecting. The streaming workload consisted of the client sending 20 bytes to the server and the server responding with 20MB.
The measurements were done on a 2064-109 using 2 LPARs. Each LPAR had 3 dedicated processors, 1GB of central storage and 2GB expanded storage. In the measurement environment each LPAR had an equal number of client and server virtual machines defined. The client(s) from one LPAR communicated with the server(s) on the other LPAR.
Both Gigabit Ethernet (QDIO) and HiperSockets were used for communication between the TCP/IP stacks running on each of the LPARs. For the QDIO measurements both the maximum transmission units (MTU) 1492 and 8992 were used. For HiperSockets 8K, 16K, 32K, and 56K MTU sizes were used. Performance runs were made using 1, 10, 20, and 50 client-server pairs for each workload.
Each scenario for QDIO and HiperSockets was run with CPU 0 specified and then with CPU 1 specified for the device on the TCP/IP DEVICE configuration statement for the TCP/IP stack on each LPAR. A complete set of runs, consisting of 3 trials for each case, was done. CP monitor data was captured for one of the LPARs during the measurement and reduced using VMPRF. In addition, Performance Toolkit for VM data was captured for the same LPAR and used to report information on the CPU utilization for each virtual CPU.
Results:
The following tables show the comparison between results on
TCP/IP 440 with (CPU 1) and without (CPU 0) the device layer
MP support active for a set of the measurements taken.
MB/sec (megabytes
per second) or trans/sec (transactions per second) were supplied
by the workload driver and shows the throughput rate. All other
values are from CP monitor data, derived from CP monitor data,
or from Performance Toolkit for VM data.
Table 1. QDIO - Streaming 1492
Client/Server pairs
1
10
20
50
CPU 0 - runid
qf0sn013
qf0sn101
qf0sn201
qf0sn502
MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util
61.31
12.77
8.66
4.11
48.97
48.70
0.00
74.83
17.08
11.43
5.65
68.97
69.00
0.00
77.05
18.92
12.77
6.15
74.62
74.58
0.00
77.06
20.71
14.05
6.66
79.52
79.47
0.00
CPU 1 - runid
qf1sn013
qf1sn103
qf1sn202
qf1sn502
MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util
69.32
14.98
10.13
4.85
67.50
NA
NA
77.44
19.21
13.09
6.12
85.90
NA
NA
80.41
20.37
13.92
6.45
92.31
68.95
22.85
83.87
21.35
14.70
6.65
98.81
75.87
23.33
%diff MB/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec
13.06
17.25
16.93
17.93
3.49
12.51
14.60
8.28
4.36
7.65
8.97
4.92
8.84
3.11
4.61
- 0.06
Table 2. QDIO - Streaming 8992
Client/Server pairs
1
10
20
50
CPU 0 - runid
qf0sj011
qf0sj102
qf0sj202
qf0sj501
MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util
59.79
10.23
6.37
3.86
36.36
36.04
0.00
98.04
12.21
7.53
4.68
66.92
66.78
0.00
98.14
12.41
7.61
4.80
68.97
68.74
0.00
96.06
12.87
7.90
4.97
70.51
70.56
0.00
CPU 1 - runid
qf1sj012
qf1sj102
qf1sj202
qf1sj502
MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util
67.06
12.98
8.10
4.88
53.59
38.53
15.70
105.90
14.59
9.21
5.38
90.26
65.93
24.82
108.18
12.87
8.13
4.74
81.90
62.48
19.88
106.90
13.69
8.67
5.02
85.78
70.63
20.43
%diff MB/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec
12.16
26.75
27.07
26.21
8.02
19.49
22.31
14.97
10.23
3.68
6.75
- 1.19
11.28
6.44
9.75
1.16
Client/Server pairs
1
10
20
50
CPU 0 - runid
qf0cn013
qf0cn102
qf0cn203
qf0cn503
trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util
148.02
1.71
1.18
0.53
10.00
9.84
0.00
443.49
1.67
1.16
0.51
23.33
23.62
0.00
535.09
1.70
1.18
0.52
26.94
26.92
0.00
706.13
1.82
1.30
0.52
36.15
36.02
0.00
CPU 1 - runid
qf1cn013
qf1cn102
qf1cn202
qf1cn502
trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util
153.33
2.02
1.35
0.67
13.85
9.05
4.48
436.44
1.85
1.27
0.58
30.77
22.25
7.73
556.68
1.86
1.28
0.58
38.06
28.20
9.22
711.28
2.02
1.43
0.59
50.79
38.28
11.27
%diff trans/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec
3.59
18.37
14.85
26.24
- 1.59
11.53
9.93
15.16
4.03
9.45
8.42
11.79
0.73
10.85
9.66
13.80
Client/Server pairs
1
10
20
50
CPU 0 - runid
qf0cj013
qf0cj101
qf0cj201
qf0cj502
trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util
146.65
1.27
0.88
0.39
10.00
9.84
0.00
453.78
1.70
1.31
0.39
23.33
23.62
0.00
522.41
1.86
1.46
0.40
26.94
26.92
0.00
716.41
1.68
1.27
0.41
36.15
36.02
0.00
CPU 1 - runid
qf1cj012
qf1cj102
qf1cj201
qf1cj501
trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util
155.80
1.40
0.94
0.46
10.00
7.09
3.07
465.32
1.95
1.50
0.45
49.74
47.16
6.04
587.90
2.03
1.57
0.46
64.10
50.56
7.98
781.14
1.85
1.37
0.48
64.87
53.02
10.54
%diff trans/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec
6.24
10.83
7.26
18.90
2.54
14.60
14.27
15.70
12.54
9.43
7.33
17.19
9.04
10.47
7.70
19.13
Table 5. HiperSocket - Streaming 8K
Client/Server pairs
1
10
20
50
CPU 0 - runid
hf0sj011
hf0sj103
hf0sj201
hf0sj501
MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util
139.12
7.57
4.51
3.06
73.30
73.45
0.00
129.58
10.17
6.16
4.01
78.61
78.48
0.00
118.43
10.77
6.54
4.23
75.56
75.40
0.00
104.12
11.90
7.38
4.52
72.86
72.66
0.00
CPU 1 - runid
hf1sj012
hf1sj103
hf1sj201
hf1sj501
MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util
163.03
8.56
5.23
3.33
104.44
75.74
28.26
160.08
9.65
5.62
4.03
115.13
89.86
25.04
138.46
10.77
6.33
4.44
109.49
87.14
21.90
127.96
11.67
6.96
4.71
106.15
86.04
20.20
%diff MB/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec
17.19
13.05
15.96
8.77
23.54
-5.04
-8.71
0.60
16.91
0.02
-3.19
5.00
22.90
-1.88
-5.60
4.17
Table 6. HiperSocket - Streaming 56K
Client/Server pairs
1
10
20
50
CPU 0 - runid
hf0s5012
hf0s5102
hf0s5201
hf0s5501
MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util
135.67
7.47
4.40
3.07
68.06
68.02
0.00
139.91
8.51
4.93
3.58
75.90
75.88
0.00
131.64
8.61
4.99
3.62
73.08
73.08
0.00
112.84
8.88
5.21
3.67
66.15
66.08
0.00
CPU 1 - runid
hf1s5013
hf1s5101
hf1s5203
hf1s5503
MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util
168.47
7.90
4.61
3.29
96.67
72.52
23.57
160.55
9.21
5.19
4.02
108.61
87.86
20.04
144.95
10.03
5.75
4.28
105.00
86.40
18.22
130.51
11.27
6.51
4.76
103.08
85.63
17.10
%diff MB/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec
24.18
5.79
4.81
7.18
14.75
8.22
5.33
12.19
10.11
16.53
15.28
18.23
15.66
26.84
24.84
29.69
Client/Server pairs
1
10
20
50
CPU 0 - runid
hf0cj013
hf0cj103
hf0cj201
hf0cj501
trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util
175.93
1.33
0.94
0.39
8.97
8.62
0.00
457.63
1.56
1.17
0.39
31.94
31.60
0.00
546.08
1.47
1.08
0.39
30.83
30.62
0.00
704.48
1.74
1.32
0.42
48.72
44.30
0.00
CPU1 - runid
hf1cj012
hf1cj101
hf1cj202
hf1cj501
trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util
185.14
1.61
1.12
0.49
14.44
10.78
3.26
486.05
2.02
1.56
0.46
53.85
47.28
5.74
601.41
1.95
1.51
0.44
62.22
53.28
7.26
789.53
1.76
1.30
0.46
61.11
47.14
9.96
%diff trans/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec
5.24
20.61
19.21
23.95
6.21
29.76
33.08
19.69
10.13
32.81
39.66
13.82
12.07
1.50
-1.53
11.08
Table 8. HiperSocket - CRR 56K
Client/Server pairs
1
10
20
50
CPU 0 - runid
hf0c5013
hf0c5101
hf0c5201
hf0c5502
trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util
174.87
1.32
0.94
0.38
9.17
8.62
0.00
460.10
1.61
1.22
0.39
34.17
32.16
0.00
544.76
1.45
1.05
0.40
28.80
28.12
0.00
706.31
1.65
1.24
0.41
44.62
43.97
0.00
CPU 1 - runid
hf1c5012
hf1c5101
hf1c5201
hf1c5503
trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util
185.71
1.66
1.18
0.48
15.28
11.38
3.56
485.49
1.96
1.52
0.44
52.78
47.88
5.89
594.50
2.00
1.54
0.46
62.56
46.60
7.80
787.43
1.80
1.33
0.47
63.59
52.46
10.20
%diff trans/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec
6.20
25.96
24.98
28.40
5.52
21.63
24.67
12.14
9.13
38.50
47.10
15.81
11.49
10.94
7.51
13.74
In general the costs per MB or transaction are higher due to the
overhead for implementing the virtual MP support. However, the
throughput, as reported by MB/sec or trans/sec, is greater in almost
all cases measured because the stack virtual machine can now use
more than one processor.
In addition, overall between 10% to 30% of the
workload is moved from CPU 0 (base processor) to CPU 1. The workload
moved from CPU 0 to CPU 1 represents the device-specific processing
which can now be done in parallel with the stack functions which
must be done on the base processor. The best case scenario above
is seen for Hipersocket - Streaming with an 8K MTU size. In this
case the percentage of the workload moved from CPU 0 to CPU 1 ranged
from 19% for 50 client-server pairs to 27% for one client-server pair.
In addition, the throughput increased over 16% in all cases while
the percent increase in CPU consumption ranged from a high of just
over 13% with one client-server pair to a decrease of over 5% for
10 client-server pairs.