Connect a VXLAN-EVPN DC to the Public Cloud the right way

In my latest blog post i was ranting on how you should not do cloud connectivity, and specifically how you should stay miles away from whoever suggests the use of vxlan to “extend layer 2”.
Today i wanted to show you instead how you could actually extend your network into the cloud to allow workload mobility. It’s assumed that your application is “cloud ready” and won’t require a layer 2 adjacency with other components.

As part of a customer project i was supposed to design a cloud connectivity solution that would allow to extend several VRFs into AWS. The requirements were very clear, so let’s list them:

  1. It is required to extend around 15 VRFs into AWS to allow application migrations into the cloud.
  2. The solution needs to be ready for other clouds like Azure or IBM Cloud
  3. The solution needs to be scalable and be able to ensure support to additional VRFs without network redesign

The high level solution

Simply put, what we did was to extend VXLAN-EVPN Overlay into AWS, specifically by making the CSR 1000v a vtep.
In my specific use case, the customer is running a dual site VXLAN-EVPN DC with EVPN Multi-Site for the DCI so we went with Cisco CSRs.
To be honest though, the solution is pretty standard and can run with every vendor.

Building the Underlay

The picture below describes at High Level how the underlay connectivity to AWS would work. The only real requirement here is jumbo MTU on whatever WAN links you use:

Building the Ovelray

Once loopback reachability is achieved, all we need to do is to establish EVPN control plane between our border leaf and the CSR 1000 and we are basically done.

Please note that in this blog post i am only discussing conceptual connectivity and there is no reference to redundancy

Proof of concept and configurations

Of course, everything looks awesome on power point, but does it work? well, let’s demo this on GSN3:

In this scenario, we will assume that the spine/leaf environment is already working with OSPF in the underlay and iBGP EVPN as control plane. We will then do the following:

  1. Configure the WAN-RTR and extend OSPF underlay routing
  2. Configure and IPSec tunnel between the on prem WAN-RTR and the AWS-GW in the cloud
  3. Configure eBGP between the WAN-RTR and AWS-GW to exchange underlay VTEP prefixes (obviously redistribute those routes into OSPF)
  4. Configure eBGP EVPN between the border leaf and the AWAS-RTR to extend the EVPN Control plane
  5. Configure vrf “PROD” in the cloud and extend it via EVPN

OSPF Routing between Border Leaf and WAN-RTR

!
interface Loopback0
description Router-ID
ip address 10.254.180.13 255.255.255.255
ip ospf network point-to-point
!
interface GigabitEthernet1
mtu 9216
ip unnumbered Loopback0
ip ospf authentication message-digest
ip ospf message-digest-key 1 md5 MyPaSsWoRd
ip ospf network point-to-point
!
router ospf 1
router-id 10.254.180.13
log-adjacency-changes detail
area 0.0.0.10 authentication message-digest
network 10.254.180.13 0.0.0.0 area 0.0.0.10
!

Internet connectivity and IPSec tunnel

The following config is applied on the WAN-RTR

!
crypto keyring auth-keyring
local-address GigabitEthernet2
pre-shared-key address 1.1.1.2 key MYSECRET
!
crypto isakmp policy 200
encryption aes
authentication pre-share
group 2
lifetime 28800
crypto isakmp profile auth-isakmpprofile
keyring auth-keyring
match identity address 1.1.1.2 255.255.255.255
local-address GigabitEthernet2
!
crypto ipsec transform-set auth-ipsec esp-aes esp-sha-hmac
mode tunnel
!
crypto ipsec profile auth-ipsecprofile
set transform-set auth-ipsec
set pfs group2
!
interface Tunnel10
ip address 2.2.2.1 255.255.255.252
tunnel source GigabitEthernet2
tunnel destination 1.1.1.2
tunnel protection ipsec profile auth-ipsecprofile
ip virtual-reassembly
!
interface GigabitEthernet2
mtu 9216
ip address 1.1.1.1 255.255.255.252
!

I will not paste the config of the AWS-GW, suffice to say that is absolutely identical, just with swapped IP addresses.

Some show commands below just to verify internet connectivity:

WAN-RTR#show ip route ospf
Codes: L - local, C - connected, S - static, R - RIP, M - mobile, B - BGP
D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
E1 - OSPF external type 1, E2 - OSPF external type 2, m - OMP
n - NAT, Ni - NAT inside, No - NAT outside, Nd - NAT DIA
i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
ia - IS-IS inter area, * - candidate default, U - per-user static route
H - NHRP, G - NHRP registered, g - NHRP registration summary
o - ODR, P - periodic downloaded static route, l - LISP
a - application route
+ - replicated route, % - next hop override, p - overrides from PfR

Gateway of last resort is not set

10.0.0.0/32 is subnetted, 7 subnets
O 10.254.180.1 [110/42] via 10.254.180.12, 00:24:40, GigabitEthernet1
O 10.254.180.11 [110/82] via 10.254.180.12, 00:24:40, GigabitEthernet1
O 10.254.180.12 [110/2] via 10.254.180.12, 00:24:45, GigabitEthernet1
O 10.254.181.1 [110/42] via 10.254.180.12, 00:24:40, GigabitEthernet1
O 10.254.181.11 [110/82] via 10.254.180.12, 00:24:40, GigabitEthernet1
O 10.254.181.12 [110/2] via 10.254.180.12, 00:21:45, GigabitEthernet1

WAN-RTR#ping 2.2.2.2
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 2/3/5 ms


WAN-RTR#show crypto ipsec sa

interface: Tunnel10
Crypto map tag: Tunnel10-head-0, local addr 1.1.1.1

protected vrf: (none)
local ident (addr/mask/prot/port): (1.1.1.1/255.255.255.255/47/0)
remote ident (addr/mask/prot/port): (1.1.1.2/255.255.255.255/47/0)
current_peer 1.1.1.2 port 500
PERMIT, flags={origin_is_acl,}
#pkts encaps: 32742, #pkts encrypt: 32742, #pkts digest: 32742
#pkts decaps: 32720, #pkts decrypt: 32720, #pkts verify: 32720
#pkts compressed: 0, #pkts decompressed: 0
#pkts not compressed: 0, #pkts compr. failed: 0
#pkts not decompressed: 0, #pkts decompress failed: 0
#send errors 0, #recv errors 0

local crypto endpt.: 1.1.1.1, remote crypto endpt.: 1.1.1.2
plaintext mtu 9150, path mtu 9216, ip mtu 9216, ip mtu idb GigabitEthernet2
current outbound spi: 0x4B316738(1261528888)
PFS (Y/N): Y, DH group: group2

inbound esp sas:
spi: 0xAFAF5054(2947502164)
transform: esp-aes esp-sha-hmac ,
in use settings ={Tunnel, }
conn id: 2029, flow_id: CSR:29, sibling_flags FFFFFFFF80004048, crypto map: Tunnel10-head-0
sa timing: remaining key lifetime (k/sec): (4607992/2808)
IV size: 16 bytes
replay detection support: Y
Status: ACTIVE(ACTIVE)

inbound ah sas:

inbound pcp sas:

outbound esp sas:
spi: 0x4B316738(1261528888)
transform: esp-aes esp-sha-hmac ,
in use settings ={Tunnel, }
conn id: 2030, flow_id: CSR:30, sibling_flags FFFFFFFF80004048, crypto map: Tunnel10-head-0
sa timing: remaining key lifetime (k/sec): (4607995/2808)
IV size: 16 bytes
replay detection support: Y
Status: ACTIVE(ACTIVE)

outbound ah sas:

outbound pcp sas:

Configuring IPV4 eBGP between WAN-RTR and AWS-GW

This is the WAN-RTR part

!
router bgp 65431
bgp router-id 10.254.180.13
bgp log-neighbor-changes
redistribute ospf 1
neighbor 2.2.2.2 remote-as 222
!
router ospf 1
redistribute bgp 65431
!

While this section is configured on AWS-GW

!
interface Loopback0
description RID
ip address 5.5.5.5 255.255.255.255
!
router bgp 222
bgp log-neighbor-changes
neighbor 2.2.2.1 remote-as 65431
!
address-family ipv4
network 5.5.5.5 mask 255.255.255.255
neighbor 2.2.2.1 activate
exit-address-family
!

Configuring a VXLAN-EVPN VTEP on the AWS-GW

10.254.180.12 is the loopback ip of the border leaf

!
interface Loopback1
description VTEP
ip address 6.6.6.6 255.255.255.255
!
interface nve1
no ip address
source-interface Loopback1
host-reachability protocol bgp
!
router bgp 222
neighbor 10.254.180.12 remote-as 65431
neighbor 10.254.180.12 ebgp-multihop 10
neighbor 10.254.180.12 update-source Loopback0
!
address-family ipv4
network 6.6.6.6 mask 255.255.255.255
no neighbor 10.254.180.12 activate
exit-address-family
!
address-family l2vpn evpn
rewrite-evpn-rt-asn
no neighbor 2.2.2.1 activate
neighbor 10.254.180.12 activate
neighbor 10.254.180.12 send-community both
exit-address-family
!

Configure EVPN Multi-Site and eBGP-EVPN on the Border leaf

!
interface loopback2
description MULTISITE-VIP
ip address 8.8.8.8/32
ip router ospf UNDERLAY area 0.0.0.10
ip pim sparse-mode
!
interface Ethernet1/1
evpn multisite fabric-tracking
!
interface Ethernet1/2
evpn multisite dci-tracking
!
evpn multisite border-gateway 100
!
interface nve1
multisite border-gateway interface loopback2
!
router bgp 65431
router-id 10.254.180.12
log-neighbor-changes
address-family l2vpn evpn
advertise-pip
neighbor 5.5.5.5
remote-as 222
update-source loopback0
ebgp-multihop 10
peer-type fabric-external
address-family l2vpn evpn
send-community
send-community extended
rewrite-evpn-rt-asn
!

Now it would be a good moment to test what we have done so far

BORDER-LEAF# show bgp l2vpn evpn summary 
BGP summary information for VRF default, address family L2VPN EVPN
BGP router identifier 10.254.180.12, local AS number 65431
BGP table version is 20, L2VPN EVPN config peers 2, capable peers 2
15 network entries and 15 paths using 2880 bytes of memory
BGP attribute entries [8/1312], BGP AS path entries [1/6]
BGP community entries [0/0], BGP clusterlist entries [1/4]

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
5.5.5.5 4 222 52 46 20 0 0 00:42:22 0
10.254.180.1 4 65431 517 511 20 0 0 00:42:14 3

Neighbor T AS PfxRcd Type-2 Type-3 Type-4 Type-5
5.5.5.5 E 222 3 0 0 0 0
10.254.180.1 I 65431 3 1 0 0 2



AWS-GW#show ip route
Codes: L - local, C - connected, S - static, R - RIP, M - mobile, B - BGP
D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
E1 - OSPF external type 1, E2 - OSPF external type 2, m - OMP
n - NAT, Ni - NAT inside, No - NAT outside, Nd - NAT DIA
i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
ia - IS-IS inter area, * - candidate default, U - per-user static route
H - NHRP, G - NHRP registered, g - NHRP registration summary
o - ODR, P - periodic downloaded static route, l - LISP
a - application route
+ - replicated route, % - next hop override, p - overrides from PfR

Gateway of last resort is not set

1.0.0.0/8 is variably subnetted, 2 subnets, 2 masks
C 1.1.1.0/30 is directly connected, GigabitEthernet2
L 1.1.1.2/32 is directly connected, GigabitEthernet2
2.0.0.0/8 is variably subnetted, 2 subnets, 2 masks
C 2.2.2.0/30 is directly connected, Tunnel10
L 2.2.2.2/32 is directly connected, Tunnel10
5.0.0.0/32 is subnetted, 1 subnets
C 5.5.5.5 is directly connected, Loopback0
6.0.0.0/32 is subnetted, 1 subnets
C 6.6.6.6 is directly connected, Loopback1
8.0.0.0/32 is subnetted, 1 subnets
B 8.8.8.8 [20/2] via 2.2.2.1, 00:39:58
10.0.0.0/32 is subnetted, 7 subnets
B 10.254.180.1 [20/42] via 2.2.2.1, 00:42:27
B 10.254.180.11 [20/82] via 2.2.2.1, 00:42:27
B 10.254.180.12 [20/2] via 2.2.2.1, 00:10:44
B 10.254.180.13 [20/0] via 2.2.2.1, 02:00:50
B 10.254.181.1 [20/42] via 2.2.2.1, 00:42:27
B 10.254.181.11 [20/82] via 2.2.2.1, 00:42:27
B 10.254.181.12 [20/2] via 2.2.2.1, 00:39:58

AWS-GW#show bgp l2vpn evpn summary
BGP router identifier 6.6.6.6, local AS number 222
BGP table version is 42, main routing table version 42
6 network entries using 2064 bytes of memory
6 path entries using 1248 bytes of memory
5/3 BGP path/bestpath attribute entries using 1440 bytes of memory
1 BGP AS-PATH entries using 24 bytes of memory
6 BGP extended community entries using 224 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 5000 total bytes of memory
BGP activity 70/48 prefixes, 84/62 paths, scan interval 60 secs
7 networks peaked at 10:38:02 Jun 21 2020 UTC (09:05:30.876 ago)

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.254.180.12 4 65431 50 53 42 0 0 00:42:57 0

Everything looks in order, so let’s configure the VRF on the CSR 1000V now:

!
vrf definition PROD
rd 5.5.5.5:100
!
address-family ipv4
route-target export 222:1000000
route-target import 222:1000000
route-target export 222:1000000 stitching
route-target import 222:1000000 stitching
exit-address-family
!
interface Loopback100
description TENANT-LOOPBACK
vrf forwarding PROD
ip address 100.100.100.1 255.255.255.255
!
interface GigabitEthernet1
description INTERFACE FACING VMS INSIDE THE CLOUD
vrf forwarding PROD
ip address 10.10.10.1 255.255.255.0
!
interface BDI100
description L3VNI-SVI
vrf forwarding PROD
ip address 33.33.33.1 255.255.255.0
!
interface nve1
member vni 1000000 vrf PROD
no mop enabled
no mop sysid
!
router bgp 222
!
address-family ipv4 vrf PROD
advertise l2vpn evpn
redistribute connected
exit-address-family
!

So, now let’s verify if the control plane looks good before we test dataplane. On the Cloud side we see this, which is very promising:

AWS-GW#show bgp l2vpn evpn         
BGP table version is 42, local router ID is 6.6.6.6
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
t secondary path, L long-lived-stale,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 5.5.5.5:100 (default for vrf PROD)
*> [5][5.5.5.5:100][0][24][10.10.10.0]/17
0.0.0.0 0 32768 ?
*> [5][5.5.5.5:100][0][24][33.33.33.0]/17
0.0.0.0 0 32768 ?
*> [5][5.5.5.5:100][0][32][100.100.100.1]/17
0.0.0.0 0 32768 ?
Route Distinguisher: 10.254.180.11:3
*> [5][10.254.180.11:3][0][24][10.10.255.0]/17
8.8.8.8 1 0 65431 ?
*> [5][10.254.180.11:3][0][32][100.100.100.11]/17
8.8.8.8 1 0 65431 ?
Route Distinguisher: 10.254.180.12:3
Network Next Hop Metric LocPrf Weight Path
*> [5][10.254.180.12:3][0][32][100.100.100.12]/17
10.254.181.12 0 0 65431 ?
AWS-GW#show ip route vrf PROD

Routing Table: PROD
Codes: L - local, C - connected, S - static, R - RIP, M - mobile, B - BGP
D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
E1 - OSPF external type 1, E2 - OSPF external type 2, m - OMP
n - NAT, Ni - NAT inside, No - NAT outside, Nd - NAT DIA
i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
ia - IS-IS inter area, * - candidate default, U - per-user static route
H - NHRP, G - NHRP registered, g - NHRP registration summary
o - ODR, P - periodic downloaded static route, l - LISP
a - application route
+ - replicated route, % - next hop override, p - overrides from PfR

Gateway of last resort is not set

10.0.0.0/8 is variably subnetted, 3 subnets, 2 masks
C 10.10.10.0/24 is directly connected, GigabitEthernet1
L 10.10.10.1/32 is directly connected, GigabitEthernet1
B 10.10.255.0/24 [20/1] via 8.8.8.8, 01:07:09, BDI100
B 10.10.255.10/32 [20/2000] via 8.8.8.8, 00:15:16, BDI100
33.0.0.0/8 is variably subnetted, 2 subnets, 2 masks
C 33.33.33.0/24 is directly connected, BDI100
L 33.33.33.1/32 is directly connected, BDI100
100.0.0.0/32 is subnetted, 3 subnets
C 100.100.100.1 is directly connected, Loopback100
B 100.100.100.11 [20/1] via 8.8.8.8, 00:48:38, BDI100
B 100.100.100.12 [20/0] via 10.254.181.12, 00:48:38, BDI100

AWS-GW#show nve peers
Interface VNI Type Peer-IP RMAC/Num_RTs eVNI state flags UP time
nve1 1000000 L3CP 8.8.8.8 0200.0808.0808 1000000 UP A/M/4 00:50:52
nve1 1000000 L3CP 10.254.181.12 0c99.26cd.6407 1000000 UP A/M/4 00:50:52

AWS-GW#show nve vni
Interface VNI Multicast-group VNI state Mode BD cfg vrf
nve1 1000000 N/A Up L3CP 100 CLI PROD

On the DC side (TOR leaf where the end host is connected) we see this:

LEAF# show bgp l2vpn evpn 
BGP routing table information for VRF default, address family L2VPN EVPN
BGP table version is 107, Local Router ID is 10.254.180.11
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - best2

Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 5.5.5.5:100
*>i[5]:[0]:[0]:[24]:[10.10.10.0]/224
8.8.8.8 100 0 222 ?
*>i[5]:[0]:[0]:[24]:[33.33.33.0]/224
8.8.8.8 100 0 222 ?
*>i[5]:[0]:[0]:[32]:[100.100.100.1]/224
8.8.8.8 100 0 222 ?

Route Distinguisher: 10.254.180.11:34001 (L2VNI 1001234)
*>l[2]:[0]:[0]:[48]:[0c99.2637.ab00]:[0]:[0.0.0.0]/216
10.254.181.11 100 32768 i
*>l[2]:[0]:[0]:[48]:[0c99.2637.ab00]:[32]:[10.10.255.10]/272
10.254.181.11 100 32768 i

Route Distinguisher: 10.254.180.12:3
*>i[2]:[0]:[0]:[48]:[0c99.26cd.6407]:[0]:[0.0.0.0]/216
10.254.181.12 100 0 i
*>i[5]:[0]:[0]:[32]:[100.100.100.12]/224
10.254.181.12 0 100 0 ?

Route Distinguisher: 10.254.180.11:3 (L3VNI 1000000)
*>l[2]:[0]:[0]:[48]:[0c99.2683.ff07]:[0]:[0.0.0.0]/216
10.254.181.11 100 32768 i
*>i[2]:[0]:[0]:[48]:[0c99.26cd.6407]:[0]:[0.0.0.0]/216
10.254.181.12 100 0 i
*>i[5]:[0]:[0]:[24]:[10.10.10.0]/224
8.8.8.8 100 0 222 ?
*>l[5]:[0]:[0]:[24]:[10.10.255.0]/224
10.254.181.11 0 100 32768 ?
*>i[5]:[0]:[0]:[24]:[33.33.33.0]/224
8.8.8.8 100 0 222 ?
*>i[5]:[0]:[0]:[32]:[100.100.100.1]/224
8.8.8.8 100 0 222 ?
*>l[5]:[0]:[0]:[32]:[100.100.100.11]/224
10.254.181.11 0 100 32768 ?
*>i[5]:[0]:[0]:[32]:[100.100.100.12]/224
10.254.181.12 0 100 0 ?

LEAF# show ip route vrf PROD
IP Route Table for VRF "PROD"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%' in via output denotes VRF

10.10.10.0/24, ubest/mbest: 1/0
*via 8.8.8.8%default, [200/0], 01:09:48, bgp-65431, internal, tag 222 (evpn) segid: 1000000 tunnelid: 0x8080808 encap: VXLAN

10.10.255.0/24, ubest/mbest: 1/0, attached
*via 10.10.255.1, Vlan1234, [0/0], 09:39:35, direct, tag 12345
10.10.255.1/32, ubest/mbest: 1/0, attached
*via 10.10.255.1, Vlan1234, [0/0], 09:39:35, local, tag 12345
10.10.255.10/32, ubest/mbest: 1/0, attached
*via 10.10.255.10, Vlan1234, [190/0], 09:39:07, hmm
33.33.33.0/24, ubest/mbest: 1/0
*via 8.8.8.8%default, [200/0], 01:09:48, bgp-65431, internal, tag 222 (evpn) segid: 1000000 tunnelid: 0x8080808 encap: VXLAN

100.100.100.1/32, ubest/mbest: 1/0
*via 8.8.8.8%default, [200/0], 01:09:48, bgp-65431, internal, tag 222 (evpn) segid: 1000000 tunnelid: 0x8080808 encap: VXLAN

100.100.100.11/32, ubest/mbest: 2/0, attached
*via 100.100.100.11, Lo100, [0/0], 02:26:37, local, tag 12345
*via 100.100.100.11, Lo100, [0/0], 02:26:37, direct, tag 12345
100.100.100.12/32, ubest/mbest: 1/0
*via 10.254.181.12%default, [200/0], 01:09:50, bgp-65431, internal, tag 65431 (evpn) segid: 1000000 tunnelid: 0xafeb50c encap: VXLAN


LEAF# show nve peers
Interface Peer-IP State LearnType Uptime Router-Mac
--------- -------------------------------------- ----- --------- -------- -----------------
nve1 8.8.8.8 Up CP 01:09:56 0200.0808.0808
nve1 10.254.181.12 Up CP 01:09:57 0c99.26cd.6407

LEAF# show nve vni
Codes: CP - Control Plane DP - Data Plane
UC - Unconfigured SA - Suppress ARP
SU - Suppress Unknown Unicast
Xconn - Crossconnect
MS-IR - Multisite Ingress Replication

Interface VNI Multicast-group State Mode Type [BD/VRF] Flags
--------- -------- ----------------- ----- ---- ------------------ -----
nve1 1000000 n/a Up CP L3 [PROD]
nve1 1001234 239.0.0.1 Up CP L2 [1234]

Since everything looks in order, let’s give it a try from our servers:

DC-SRV:~# ping 10.10.10.100
PING 10.10.10.100 (10.10.10.100): 56 data bytes
64 bytes from 10.10.10.100: seq=0 ttl=61 time=63.145 ms
64 bytes from 10.10.10.100: seq=1 ttl=61 time=33.286 ms
64 bytes from 10.10.10.100: seq=2 ttl=61 time=24.156 ms
64 bytes from 10.10.10.100: seq=3 ttl=61 time=152.668 ms
64 bytes from 10.10.10.100: seq=4 ttl=61 time=95.257 ms
^C
--- 10.10.10.100 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 24.156/73.702/152.668 ms

DC-SRV:~# traceroute 10.10.10.100
traceroute to 10.10.10.100 (10.10.10.100), 30 hops max, 46 byte packets
1 10.10.255.1 (10.10.255.1) 7.546 ms 10.992 ms 8.583 ms
2 100.100.100.12 (100.100.100.12) 58.480 ms 31.835 ms 35.748 ms
3 33.33.33.1 (33.33.33.1) 33.018 ms 34.620 ms 37.632 ms
4 10.10.10.100 (10.10.10.100) 42.153 ms 42.390 ms 37.878 ms

As you can see we have connectivity!!!

In summary

  1. Do not extend layer 2 into the cloud, it’s just dumb!
  2. VXLAN-EVPN gives you a very simple way to extend your network into the cloud
  3. The solution really works with every vendor, even if we just demoed Cisco
  4. A big advantage of EVPN is that you only need to define the VRFs at the edge and the control plane will take care of the rest, the transit will be totally transparent
  5. Do NOT forget Jumbo MTU, you really don’t want to fragment 🙂

Public Cloud connectivity done wrong

If your idea for interconnection and migration to the private cloud involves using NSX and L2VPN so that you can “stretch the vlan” between your NSX private farm and the one into the Cloud you are doing it wrong.

No matter if you are using VXLAN as a transport or any other technology, if your plan involves layer 2 extension you are doing it wrong.

Not every application should be migrated to the public cloud, and most definitely you should not migrate something that relies on a layer 2 adjacency to work.

If layer 2 extension is a way to allow ip mobility, then again, it’s just a lazy design. There are better ways to provide same-subnet IP mobility that doesn’t require layer 2 (see LISP or BGP-EVPN Type 5 routing for example).

Even if it works on Power Point or on a small demo, you really should NOT.

Scaling EVPN Multi-Site Overlays using Route-Servers

Cisco’s EVPN Multi-Site it’s a great technology that allows us to achieve massive scale of an EVPN network. With the latest release, the official scalability numbers give us something in the realm of over 12000 VTEPs (512 VTEPs per site x 25 sites).
I’m in no way suggesting that you would need such a big topology and you definitely should segment way sooner you reach the limit, but still…

The main configuration requirement for the Multi-Site overlay is to have a full mesh of eBGP peering between all border gateways.

This has scalability drawbacks as usual. Not only each leaf will have ever growing number of peers which will soon grow out of control, but maybe, worse is the fact that after one site is added, every other site must be touched too.

To avoid a full mesh, for iBGP topologies we would be using a Route Reflector, but with eBGP that’s obviously not an option. So, instead of a RR they way to scale eBGP peerings is to leverage a Route-Server.

A Route-Server provides route reflection capabilities and as such it must ensure that NLRIs attributes like the Next Hop and route-targets aren’t changed.
In Cisco’s EVPN implementation, the auto defined route-targets are based on ASN:VNI, and in order to be able to use this simplified config, the RS should also support the “rewrite-evpn-rt-asn” feature; if that’s not the case, then hard coded and consistent route-targets must be defined across the VTEPs in the network. Finally, the route-server doesn’t have to be in the data plane since it’s only a control plane node.

Unfortunately, for EVPN, there isn’t a “route-server-client” configuration nob yet 😦 , nor we can find a configuration example in the Cisco pages. Fortunately, knowing the requirements we can figure out how the config should look like.

NX-OS EVPN Route-Server Configuration

feature nv overlay
nv overlay evpn
feature bgp
!
route-map RETAIN-ORIGINAL-VTEP-NEXTHOP permit 10
set ip next-hop unchanged
!
router bgp 12345
log-neighbor-changes
address-family l2vpn evpn
retain route-target all
neighbor 1.1.1.0
remote-as 100
address-family l2vpn evpn
send-community
send-community extended
route-map RETAIN-ORIGINAL-VTEP-NEXTHOP out
rewrite-evpn-rt-asn
neighbor 1.1.1.3
remote-as 200
address-family l2vpn evpn
send-community
send-community extended
route-map RETAIN-ORIGINAL-VTEP-NEXTHOP out
rewrite-evpn-rt-asn

IOS-XE EVPN Route-Server Configuration

!
route-map RETAIN-ORIGINAL-VTEP-NEXTHOP permit 10
set ip next-hop unchanged
!
router bgp 12345
bgp log-neighbor-changes
no bgp default route-target filter
neighbor 1.1.1.0 remote-as 100
neighbor 1.1.1.0 disable-connected-check
neighbor 1.1.1.3 remote-as 200
neighbor 1.1.1.3 disable-connected-check
!
address-family l2vpn evpn
rewrite-evpn-rt-asn
neighbor 1.1.1.0 activate
neighbor 1.1.1.0 send-community both
neighbor 1.1.1.0 soft-reconfiguration inbound
neighbor 1.1.1.0 route-map RETAIN-ORIGINAL-VTEP-NEXTHOP out
neighbor 1.1.1.3 activate
neighbor 1.1.1.3 send-community both
neighbor 1.1.1.3 soft-reconfiguration inbound
neighbor 1.1.1.3 route-map RETAIN-ORIGINAL-VTEP-NEXTHOP out
exit-address-family
!

Just a couple of notes on the above config:

  1. The command “disable-connected-check” is required otherwise the router will reject received prefixes with “DENIED due to: non-connected MP_REACH NEXTHOP
  2. The command “next-hop-unchanged” has no effect in the address-family L2VPN EVPN (probably a bug). A route-map is necessary in order to achieve the same result.

IOS-XR EVPN Route-Server Configuration

!
route-policy ACCEPT-ALL
pass
end-policy
!
router bgp 12345
nsr
bgp router-id 1.1.1.1
bgp graceful-restart
!
address-family l2vpn evpn
retain route-target all
!
neighbor 1.1.1.0
remote-as 100
ignore-connected-check
!
address-family l2vpn evpn
send-community-ebgp
route-policy ACCEPT-ALL in
route-policy ACCEPT-ALL out
send-extended-community-ebgp
soft-reconfiguration inbound always
next-hop-unchanged
!
!
neighbor 1.1.1.3
remote-as 200
ignore-connected-check
!
address-family l2vpn evpn
send-community-ebgp
route-policy ACCEPT-ALL in
route-policy ACCEPT-ALL out
send-extended-community-ebgp
soft-reconfiguration inbound always
next-hop-unchanged
!
!
!

As for IOS-XE, the command “ignore-connected-check” is required.
Additionally, IOS-XR unfortunately doesn’t support “rewrite-evpn-rt-asn“.
This means that each VTEP will need to have manually configured the appropriate route-targets highly increasing configuration complexity.

Unless you have some automation backing up your EVPN deployment, probably isn’t a good idea to use IOS-XR as an EVPN Route-Server.

Do you have anything else to add? Then contact me, or leave a message below.

SONiC and White Box switches in the Enterprise DC! – Part 3

After discussing the architecture of our design during part 1, and the underlay configuration during part 2, today i’ll show how the overlay it’s configured and hopefully we will be able to draw our conclusions to the question: Are SONiC and White Box switches ready to be used in the enterprise DC?

Our two servers will be connected with LACP and trunk interfaces. 1 VLAN will be bridged (no SVI) and both servers will have an interface into such vlan so that layer 2 can be tested.
Other 2 vlans instead will each be configured on a different pair of switches together with an SVI so that Layer 3 symmetric IRB can be tested.

VRF Configuration

First of all, let’s create a VRF. This vrf requires an VLAN and a Layer 3 VNI for symmetric IRB to function. Configuration is really simple, but a small caveat must be overlooked, specifically every vrf must contain the prefix Vrf- in the name.

From a configuration point of view, we have to follow the usual steps:

  1. Create a VRF
  2. Create a Vlan and allow it to the peer-link port channel
  3. Create a SVI interface and assign it to the VRF itself
  4. Associate the VNI to the vlan, then map it as a L3 VNI
  5. Configure BGP’s AF (in FRR)
config vrf add Vrf-prod 
config vlan add 3800
config vlan member add 3800 PortChannel1
config interface vrf bind Vlan3800 Vrf-prod
config vxlan map add nve1 3800 1000000
config vrf add_vrf_vni_map Vrf-prod 1000000

vtysh
conf t
router bgp 65000 vrf Vrf-prod
bgp router-id 10.0.0.11
bgp log-neighbor-changes
bgp graceful-restart
bgp graceful-restart preserve-fw-state
!
address-family ipv4 unicast
redistribute connected
exit-address-family
!
address-family l2vpn evpn
advertise ipv4 unicast
exit-address-family
end
exit

Once done, we can easily verify the config

root@SONIC-Leaf301:/home/admin# show vlan brief
+-----------+---------------+--------------+----------------+-----------------------+
| VLAN ID | IP Address | Ports | Port Tagging | DHCP Helper Address |
+===========+===============+==============+================+=======================+
| 3800 | | PortChannel1 | tagged | |
+-----------+---------------+--------------+----------------+-----------------------+
| 3965 | 10.10.10.0/31 | PortChannel1 | tagged | |
+-----------+---------------+--------------+----------------+-----------------------+
root@SONIC-Leaf301:/home/admin# show vxlan vlanvnimap
+----------+---------+
| VLAN | VNI |
+==========+=========+
| Vlan3800 | 1000000 |
+----------+---------+
Total count : 1

root@SONIC-Leaf301:/home/admin# show vxlan vrfvnimap
+----------+---------+
| VRF | VNI |
+==========+=========+
| Vrf-prod | 1000000 |
+----------+---------+
Total count : 1

root@SONIC-Leaf301:/home/admin# show ip route vrf Vrf-prod
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware
VRF Vrf-prod:
C>* 1.1.1.0/31 is directly connected, Vlan1234, 00:00:06
B>* 1.1.1.2/31 [200/0] via 11.11.11.113, Vlan3800 onlink, 00:21:34
C>* 100.100.100.1/32 is directly connected, Loopback100, 00:07:20
B>* 100.100.100.2/32 [200/0] via 1.1.1.1, Vlan1234, 00:00:04
B>* 100.100.100.3/32 [200/0] via 11.11.11.113, Vlan3800 onlink, 00:21:34
B>* 100.100.100.4/32 [200/0] via 11.11.11.113, Vlan3800 onlink, 00:21:34

root@SONIC-Leaf301:/home/admin# show vxlan tunnel
+--------------+--------------+-------------------+--------------+
| SIP | DIP | Creation Source | OperStatus |
+==============+==============+===================+==============+
| 11.11.11.111 | 11.11.11.113 | EVPN | oper_up |
+--------------+--------------+-------------------+--------------+
Total count : 1

root@SONIC-Leaf301:/home/admin# ping 100.100.100.2 -I Vrf-prod
ping: Warning: source address might be selected on device other than Vrf-prod.
PING 100.100.100.2 (100.100.100.2) from 1.1.1.0 Vrf-prod: 56(84) bytes of data.
64 bytes from 100.100.100.2: icmp_seq=1 ttl=64 time=0.255 ms
64 bytes from 100.100.100.2: icmp_seq=2 ttl=64 time=0.239 ms
^C
--- 100.100.100.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1007ms
rtt min/avg/max/mdev = 0.239/0.247/0.255/0.008 ms
root@SONIC-Leaf301:/home/admin# ping 100.100.100.3 -I Vrf-prod
ping: Warning: source address might be selected on device other than Vrf-prod.
PING 100.100.100.3 (100.100.100.3) from 100.100.100.1 Vrf-prod: 56(84) bytes of data.
64 bytes from 100.100.100.3: icmp_seq=1 ttl=64 time=0.452 ms
64 bytes from 100.100.100.3: icmp_seq=2 ttl=64 time=0.301 ms
^C
--- 100.100.100.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1005ms
rtt min/avg/max/mdev = 0.301/0.376/0.452/0.077 ms
root@SONIC-Leaf301:/home/admin# ping 100.100.100.4 -I Vrf-prod
ping: Warning: source address might be selected on device other than Vrf-prod.
PING 100.100.100.4 (100.100.100.4) from 100.100.100.1 Vrf-prod: 56(84) bytes of data.
64 bytes from 100.100.100.4: icmp_seq=1 ttl=63 time=0.345 ms
64 bytes from 100.100.100.4: icmp_seq=2 ttl=63 time=0.279 ms
64 bytes from 100.100.100.4: icmp_seq=3 ttl=63 time=0.251 ms
^C
--- 100.100.100.4 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2045ms
rtt min/avg/max/mdev = 0.251/0.291/0.345/0.044 ms

I’ve also included some one unique Loopback on each leaf, and vrf-lite iBGP between the two MCLAG peers across the peer-link (the reason why this is necessary it’s left to the reader to figure out, at least for now 😉 ).
Connectivity between Loopbacks is also verified.

VLAN Configuration

It might have been overlooked, but while configuring a VRF, we already configured a vlan (vlan 3800). But let’s give it another try.

config vlan add 200
config vlan member add 200 PortChannel1
config vxlan map add nve1 200 1000200

The configuration of an SVI (when necessary) is also trivial, we just need to take care of enabling suppress-arp and to specify that the IP address is a Distributed Anycast Gateway (DAG):

config interface vrf bind Vlan200 Vrf-prod
config interface ip anycast-address add Vlan200 10.10.200.1/24
config neigh_suppress enable 200

In my case, i also want to configure DHCP relay from my server, and i can do that with a single line (can you tell why i need to enable option 82 sub-option link selection?):

config interface ip dhcp-relay add Vlan200 10.10.10.100 10.10.10.101 -src-intf=Loopback100 -link-select=enable

On top of the previously used show commands, other commands can be used to verify the config applied:

root@SONIC-Leaf301:/home/admin# show ip static-anycast-gateway 
Configured Anycast Gateway MAC address: 00:00:22:22:33:33
IPv4 Anycast Gateway MAC address: enable
Total number of gateway: 2
Total number of gateway admin UP: 2
Total number of gateway oper UP: 2
Interfaces Gateway Address Vrf Admin/Oper
------------ ----------------- -------- ------------
Vlan200 10.10.200.1/24 Vrf-prod up/up
Vlan500 10.10.10.1/24 Vrf-prod up/up

root@SONIC-Leaf301:/home/admin# show neigh-suppress all
+----------+----------------+---------------------+
| VLAN | STATUS | ASSOCIATED_NETDEV |
+==========+================+=====================+
| Vlan3800 | Not Configured | nve1-3800 |
+----------+----------------+---------------------+
| Vlan100 | Not Configured | nve1-100 |
+----------+----------------+---------------------+
| Vlan200 | Configured | nve1-200 |
+----------+----------------+---------------------+
| Vlan500 | Configured | nve1-500 |
+----------+----------------+---------------------+
Total count : 4

root@SONIC-Leaf301:/home/admin# show ip dhcp-relay brief
+------------------+-----------------------+
| Interface Name | DHCP Helper Address |
+==================+=======================+
| Vlan200 | 10.10.10.100 |
| | 10.10.10.101 |
+------------------+-----------------------+

SONIC-Leaf301# show bgp l2vpn evpn vni
Advertise Gateway Macip: Disabled
Advertise SVI Macip: Disabled
Advertise All VNI flag: Enabled
BUM flooding: Head-end replication
Number of L2 VNIs: 3
Number of L3 VNIs: 1
Flags: * - Kernel
VNI Type RD Import RT Export RT Tenant VRF
* 1000200 L2 10.0.0.11:200 65000:1000200 65000:1000200 Vrf-prod
* 1000500 L2 10.0.0.11:500 65000:1000500 65000:1000500 Vrf-prod
* 1000100 L2 10.0.0.11:100 65000:1000100 65000:1000100 default
* 1000000 L3 10.0.0.11:5096 65000:1000000 65000:1000000 Vrf-prod

We did quiet a big deal of configuration right now, but of course, we cannot see anything unless we configure the ports facing our hosts

Hosts port configuration

The switches i am working on, have a limitation where every 12 ports must have exactly the same speed. This is an issue of this specific switch, not a sonic problem, nonetheless we need to be aware of it.

root@SONIC-Leaf301:/home/admin# show portgroup                         
portgroup ports valid speeds
----------- ------------- ----------------
1 Ethernet0-11 25000,10000,1000
2 Ethernet12-23 25000,10000,1000
3 Ethernet24-35 25000,10000,1000
4 Ethernet36-47 25000,10000,1000

root@SONIC-Leaf301:/home/admin# config portgroup speed 1 10000
Config portgroup 1 speed 10000

Now it’s time to configure our MCLAG port-channel:

config portchannel add PortChannel100      
config portchannel member add PortChannel100 Ethernet9
config mclag member add 1 PortChannel100
config interface startup Ethernet9

to verify the config:

root@SONIC-Leaf301:/home/admin# show interfaces portchannel 
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available, S - selected, D - deselected
No. Team Dev Protocol Ports
----- -------------- ----------- ---------------------------
1 PortChannel1 LACP(A)(Up) Ethernet52(S) Ethernet48(S)
100 PortChannel100 LACP(A)(Up) Ethernet9(S)
101 PortChannel101 LACP(A)(Up) Ethernet1(S)

admin@SONIC-Leaf301:~$ sonic-cli
SONIC-Leaf301# show mclag brief

Domain ID : 1
Role : active
Session Status : up
Peer Link Status : up
Source Address : 10.0.0.11
Peer Address : 10.0.0.12
Peer Link : PortChannel1
Keepalive Interval : 1 secs
Session Timeout : 30 secs
System Mac : 80:a2:35:81:dd:f0


Number of MLAG Interfaces:2
-----------------------------------------------------------
MLAG Interface Local/Remote Status
-----------------------------------------------------------
PortChannel101 up/up
PortChannel100 up/up

And to finish, we only need to add the vlans to the trunks:

config vlan member add 100 PortChannel100 
config vlan member add 200 PortChannel100
root@SONIC-Leaf301:/home/admin# show vlan brief 
+-----------+---------------+----------------+----------------+-----------------------+
| VLAN ID | IP Address | Ports | Port Tagging | DHCP Helper Address |
+===========+===============+================+================+=======================+
| 100 | | PortChannel1 | tagged | |
| | | PortChannel100 | tagged | |
+-----------+---------------+----------------+----------------+-----------------------+
| 200 | | PortChannel1 | tagged | 10.10.10.100 |
| | | PortChannel100 | tagged | 10.10.10.101 |
+-----------+---------------+----------------+----------------+-----------------------+
| 500 | | PortChannel1 | tagged | |
| | | PortChannel101 | tagged | |
+-----------+---------------+----------------+----------------+-----------------------+
| 1234 | 1.1.1.0/31 | PortChannel1 | tagged | |
+-----------+---------------+----------------+----------------+-----------------------+
| 3800 | | PortChannel1 | tagged | |
+-----------+---------------+----------------+----------------+-----------------------+
| 3965 | 10.10.10.0/31 | PortChannel1 | tagged | |
+-----------+---------------+----------------+----------------+-----------------------+

Now, if everything works as expect, i should be able to see remote mac addresses as well as /32 host routes in the routing table:

root@SONIC-Leaf301:/home/admin# show lldp table 
Capability codes: (R) Router, (B) Bridge, (O) Other
LocalPort RemoteDevice RemotePortID Capability RemotePortDescr
----------- -------------- ----------------- ------------ -------------------------
Ethernet1 csp-srv-02 90:e2:ba:f6:cd:6d O Interface 13 as enp7s0f1
Ethernet9 MKTK-SW01 bond1.200 R
Ethernet48 SONIC-Leaf302 80:a2:35:81:e3:f0 BR Ethernet48
Ethernet52 SONIC-Leaf302 80:a2:35:81:e3:f0 BR Ethernet52
Ethernet72 SONIC-Spine31 80:a2:35:f2:7f:94 BR Ethernet120
Ethernet76 SONIC-Spine32 80:a2:35:f2:80:c0 BR Ethernet120
eth0 c6500-vxlan Gi3/36 BR GigabitEthernet3/36
--------------------------------------------------
Total entries displayed: 7

root@SONIC-Leaf301:/home/admin# show mac
No. Vlan MacAddress Port Type
----- ------ ----------------- ----------------------- -------
1 100 B8:69:F4:99:D1:4A PortChannel100 Dynamic
2 100 B8:69:F4:99:D1:4C VxLAN DIP: 11.11.11.113 Dynamic
3 200 B8:69:F4:99:D1:4A PortChannel100 Dynamic
4 500 02:5C:1F:02:1F:11 VxLAN DIP: 11.11.11.113 Dynamic
5 500 02:5C:1F:02:20:10 PortChannel101 Dynamic
6 1234 80:A2:35:81:E3:F0 PortChannel1 Static
7 3965 80:A2:35:81:E3:F0 PortChannel1 Static
Total number of entries 7

Remote VTEP IP al invalid format
root@SONIC-Leaf301:/home/admin# show vxlan evpn_remote_mac all
+---------+-------------------+--------------+---------+---------+
| VLAN | MAC | RemoteVTEP | VNI | Type |
+=========+===================+==============+=========+=========+
| Vlan100 | b8:69:f4:99:d1:4c | 11.11.11.113 | 1000100 | dynamic |
+---------+-------------------+--------------+---------+---------+
| Vlan500 | 02:5c:1f:02:1f:11 | 11.11.11.113 | 1000500 | dynamic |
+---------+-------------------+--------------+---------+---------+
Total count : 2

root@SONIC-Leaf301:/home/admin# show ip route vrf Vrf-prod
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware
VRF Vrf-prod:
C>* 1.1.1.0/31 is directly connected, Vlan1234, 02:05:54
B>* 1.1.1.2/31 [200/0] via 11.11.11.113, Vlan3800 onlink, 02:27:22
C>* 10.10.10.0/24 is directly connected, sag500.256, 01:12:29
B>* 10.10.10.100/32 [200/0] via 11.11.11.113, Vlan3800 onlink, 02:27:22
B>* 10.10.30.0/24 [200/0] via 11.11.11.113, Vlan3800 onlink, 02:27:22
B>* 10.10.30.151/32 [200/0] via 11.11.11.113, Vlan3800 onlink, 02:27:22
C>* 10.10.200.0/24 is directly connected, sag200.256, 01:12:29
C>* 100.100.100.1/32 is directly connected, Loopback100, 02:13:08
B>* 100.100.100.2/32 [200/0] via 1.1.1.1, Vlan1234, 02:05:52
B>* 100.100.100.3/32 [200/0] via 11.11.11.113, Vlan3800 onlink, 02:27:22
B>* 100.100.100.4/32 [200/0] via 11.11.11.113, Vlan3800 onlink, 02:27:22

And of course, let’s not forget the real test, data-plane testing from the servers themselves.

DHCP:

[admin@MKTK-SW01] > /ip dhcp-client print 
Flags: X - disabled, I - invalid, D - dynamic
# INTERFACE USE-PEER-DNS ADD-DEFAULT-ROUTE STATUS ADDRESS
0 bond1.200 yes yes bound 10.10.200.149/24
1 bond2.300 yes yes bound 10.10.30.151/24

Bridging:

[admin@MKTK-SW01] > ping 10.10.100.20 routing-table=HOST1 interface=bond1.100
  SEQ HOST                                     SIZE TTL TIME  STATUS  
    0 10.10.100.20                               56  64 0ms  
    1 10.10.100.20                               56  64 0ms  
    2 10.10.100.20                               56  64 0ms  
    3 10.10.100.20                               56  64 0ms  
    4 10.10.100.20                               56  64 0ms  
    sent=5 received=5 packet-loss=0% min-rtt=0ms avg-rtt=0ms max-rtt=0ms 

[admin@MKTK-SW01] > tool traceroute 10.10.100.20 routing-table=HOST1 interface=bond1.100 
 # ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS
 1 10.10.100.20                       0%    5   0.1ms     0.1     0.1     0.1       0

Routing:

[admin@MKTK-SW01] > ping 10.10.30.151 routing-table=HOST1 interface=bond1.200 
  SEQ HOST                                     SIZE TTL TIME  STATUS 
    0 10.10.30.151                               56  64 0ms  
    1 10.10.30.151                               56  64 0ms  
    2 10.10.30.151                               56  64 0ms  
    3 10.10.30.151                               56  64 0ms  
    4 10.10.30.151                               56  64 0ms  
    sent=5 received=5 packet-loss=0% min-rtt=0ms avg-rtt=0ms max-rtt=0ms 

[admin@MKTK-SW01] > tool traceroute 10.10.30.151 routing-table=HOST1 interface=bond1.200 
 # ADDRESS                          LOSS SENT    LAST     AVG    BEST   WORST STD-DEV STATUS
 1 10.10.200.1                        0%    5   0.2ms     0.3     0.2     0.3       0
 2 1.1.1.3                            0%    5   0.2ms     0.2     0.2     0.3       0
 3 10.10.30.151                       0%    5   0.1ms     0.1     0.1     0.1       0

In Conclusion

With what we have seen so far, i really believe that SONiC is mature enough to cover most of the common DC network requirements. Notice that differently than other vendor’s solution that believe can do everything, including make you coffee or take you to the moon; SONiC is more specialised, does a few things and does them very well.
As long as what you need to do is supported by SONiC, then go ahead, it isn’t going to disappoint you.

An enterprise that considers to run SONiC should also understand the support model.
SONiC itself comes without support, and really, here we are looking at a typical open-source situation where you can choose to operate a software completely free of charge on your own, or you could pay a reputable company to provide you with a patched and supported software revision (a bit like Red Hat or SUSE Linux).

From an hardware standpoint, i think that the white boxes are mature enough. For example, Edge-Core’s AS7326-56X is basically identical to Juniper’s QFX 5120 (including port groups).
We are in the same world as your servers really. You can get your hardware from any vendor or you can find a trusted one like Dell. It’s up to you really..

In short then, what are the take away for “standard” enterprises?

  1. SONiC will work great if what you need to do fits the supported features
  2. White Box switches are comparable or identical to big vendor’s hardware
  3. You REALLY should be looking at someone to provide you with end to end support though. Maybe someone like Broadcom or other Service Providers to ensure you can get a single point of contact for all of your possible problems
  4. The knowledge gap can be scary at first, but it’s no longer a big obstacle. ACI for example was a nightmare and took me forever to learn and understand, SONiC on the other end was a piece of cake.
  5. Try and Experiment, open networking is so cheap that it costs almost nothing to bring up a small lab or even a production POC.

SONiC and White Box switches in the Enterprise DC! – Part 2

As discussed during our part 1, we are trying to configure a VXLAN-EVPN fabric using SONiC on white box switches in order to determine if Open Networking is ready to be deployed in most enterprise DCs.

As a small Recap, below is the topology we are trying to bring online:

Familiarise with the OS

The most interesting thing of SONiC is its architecture!
I’ll write a blog just about it because it’s a fascinating topic, but in short, every single process is living inside a dedicated container.

Linux SONIC-Leaf301 4.9.0-11-2-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64
You are on
  ____   ___  _   _ _  ____
 / ___| / _ \| \ | (_)/ ___|
 \___ \| | | |  \| | | |
  ___) | |_| | |\  | | |___
 |____/ \___/|_| \_|_|\____|

-- Software for Open Networking in the Cloud --

Unauthorized access and/or use are prohibited.
All access and/or use are subject to monitoring.

Help:    http://azure.github.io/SONiC/

Last login: Thu Apr 20 12:52:21 2017 from 192.168.0.31
admin@SONIC-Leaf301:~$ show version 

SONiC Software Version: SONiC-OS-3.0.1-Enterprise_Advanced
Product: Enterprise Advanced SONiC OS - Powered by Broadcom
Distribution: Debian 9.12
Kernel: 4.9.0-11-2-amd64
Build commit: aff85dcf1
Build date: Mon Apr 20 09:48:32 UTC 2020
Built by: sonicbld@sonic-lvn-csg-004

Platform: x86_64-accton_as7326_56x-r0
HwSKU: Accton-AS7326-56X
ASIC: broadcom
Serial Number: 732656X1916020
Uptime: 13:04:04 up 12 days, 29 min,  1 user,  load average: 2.14, 2.46, 2.42

Docker images:
REPOSITORY                       TAG                         IMAGE ID            SIZE
docker-sonic-telemetry           3.0.1-Enterprise_Advanced   15838a6dd8b6        397MB
docker-sonic-telemetry           latest                      15838a6dd8b6        397MB
docker-sonic-mgmt-framework      3.0.1-Enterprise_Advanced   c30542f39b9c        445MB
docker-sonic-mgmt-framework      latest                      c30542f39b9c        445MB
docker-swss-brcm-ent-advanced    3.0.1-Enterprise_Advanced   7db851611618        338MB
docker-swss-brcm-ent-advanced    latest                      7db851611618        338MB
docker-broadview                 3.0.1-Enterprise_Advanced   547641c0c886        330MB
docker-broadview                 latest                      547641c0c886        330MB
docker-nat                       3.0.1-Enterprise_Advanced   b819906b8bb6        320MB
docker-nat                       latest                      b819906b8bb6        320MB
docker-vrrp                      3.0.1-Enterprise_Advanced   2f0615d57ea4        333MB
docker-vrrp                      latest                      2f0615d57ea4        333MB
docker-teamd                     3.0.1-Enterprise_Advanced   3487700dc8d2        318MB
docker-teamd                     latest                      3487700dc8d2        318MB
docker-fpm-frr                   3.0.1-Enterprise_Advanced   bf6e7649147e        367MB
docker-fpm-frr                   latest                      bf6e7649147e        367MB
docker-iccpd                     3.0.1-Enterprise_Advanced   1c24858c993b        320MB
docker-iccpd                     latest                      1c24858c993b        320MB
docker-l2mcd                     3.0.1-Enterprise_Advanced   b0f6db69227b        319MB
docker-l2mcd                     latest                      b0f6db69227b        319MB
docker-stp                       3.0.1-Enterprise_Advanced   c812baaadda5        316MB
docker-stp                       latest                      c812baaadda5        316MB
docker-udld                      3.0.1-Enterprise_Advanced   66c2afbe849a        316MB
docker-udld                      latest                      66c2afbe849a        316MB
docker-sflow                     3.0.1-Enterprise_Advanced   9cf4e8a00ff9        318MB
docker-sflow                     latest                      9cf4e8a00ff9        318MB
docker-dhcp-relay                3.0.1-Enterprise_Advanced   5217cd436c40        326MB
docker-dhcp-relay                latest                      5217cd436c40        326MB
docker-syncd-brcm-ent-advanced   3.0.1-Enterprise_Advanced   800a3fc3af8b        439MB
docker-syncd-brcm-ent-advanced   latest                      800a3fc3af8b        439MB
docker-lldp-sv2                  3.0.1-Enterprise_Advanced   3a2e52d444f9        309MB
docker-lldp-sv2                  latest                      3a2e52d444f9        309MB
docker-snmp-sv2                  3.0.1-Enterprise_Advanced   d5a8e1d0ba7d        342MB
docker-snmp-sv2                  latest                      d5a8e1d0ba7d        342MB
docker-tam                       3.0.1-Enterprise_Advanced   272eabe18352        361MB
docker-tam                       latest                      272eabe18352        361MB
docker-pde                       3.0.1-Enterprise_Advanced   6ff2567c42b8        495MB
docker-pde                       latest                      6ff2567c42b8        495MB
docker-platform-monitor          3.0.1-Enterprise_Advanced   0b22d6abcd9a        367MB
docker-platform-monitor          latest                      0b22d6abcd9a        367MB
docker-router-advertiser         3.0.1-Enterprise_Advanced   9d201b15eae3        288MB
docker-router-advertiser         latest                      9d201b15eae3        288MB
docker-database                  3.0.1-Enterprise_Advanced   fb46e0661772        288MB
docker-database                  latest                      fb46e0661772        288MB

Below is the list of interfaces on my leaf. Notice how the naming of such interfaces can be confusing, specifically for the ones that can be channelised (like 40/100Gbps interfaces which support breakout). The primary channel is used as the interface number like with interface Ethernet48. In case an interface is then broken out, the other channels will be listed as Ethernet49, 50 and 51, making the next physical interface Ethernet52.
Interface Aliases are really interesting, unfortunately they currently act more like a description and even switching to “alias” as the default interface name, such alias is used in very few places making it pretty much useless as of now.

admin@SONIC-Leaf301:~$ show interfaces status 
  Interface            Lanes    Speed    MTU             Alias    Vlan    Oper    Admin             Type    Asym PFC
-----------  ---------------  -------  -----  ----------------  ------  ------  -------  ---------------  ----------
  Ethernet0                3      25G   9100   twentyFiveGigE1  routed    down     down   SFP/SFP+/SFP28         N/A
  Ethernet1                2      25G   9100   twentyFiveGigE2  routed    down     down   SFP/SFP+/SFP28         N/A
  Ethernet2                4      25G   9100   twentyFiveGigE3  routed    down     down              N/A         N/A
  Ethernet3                8      25G   9100   twentyFiveGigE4  routed    down     down              N/A         N/A
  Ethernet4                7      25G   9100   twentyFiveGigE5  routed    down     down              N/A         N/A
  Ethernet5                1      25G   9100   twentyFiveGigE6  routed    down     down              N/A         N/A
  Ethernet6                5      25G   9100   twentyFiveGigE7  routed    down     down              N/A         N/A
  Ethernet7               16      25G   9100   twentyFiveGigE8  routed    down     down              N/A         N/A
  Ethernet8                6      25G   9100   twentyFiveGigE9  routed    down     down              N/A         N/A
  Ethernet9               14      25G   9100  twentyFiveGigE10  routed    down     down   SFP/SFP+/SFP28         N/A
 Ethernet10               13      25G   9100  twentyFiveGigE11  routed    down     down              N/A         N/A
 Ethernet11               15      25G   9100  twentyFiveGigE12  routed    down     down              N/A         N/A
 Ethernet12               23      25G   9100  twentyFiveGigE13  routed    down     down              N/A         N/A
 Ethernet13               22      25G   9100  twentyFiveGigE14  routed    down     down              N/A         N/A
 Ethernet14               24      25G   9100  twentyFiveGigE15  routed    down     down              N/A         N/A
 Ethernet15               32      25G   9100  twentyFiveGigE16  routed    down     down              N/A         N/A
 Ethernet16               31      25G   9100  twentyFiveGigE17  routed    down     down              N/A         N/A
 Ethernet17               21      25G   9100  twentyFiveGigE18  routed    down     down              N/A         N/A
 Ethernet18               29      25G   9100  twentyFiveGigE19  routed    down     down              N/A         N/A
 Ethernet19               36      25G   9100  twentyFiveGigE20  routed    down     down              N/A         N/A
 Ethernet20               30      25G   9100  twentyFiveGigE21  routed    down     down              N/A         N/A
 Ethernet21               34      25G   9100  twentyFiveGigE22  routed    down     down              N/A         N/A
 Ethernet22               33      25G   9100  twentyFiveGigE23  routed    down     down              N/A         N/A
 Ethernet23               35      25G   9100  twentyFiveGigE24  routed    down     down              N/A         N/A
 Ethernet24               43      25G   9100  twentyFiveGigE25  routed    down     down              N/A         N/A
 Ethernet25               42      25G   9100  twentyFiveGigE26  routed    down     down              N/A         N/A
 Ethernet26               44      25G   9100  twentyFiveGigE27  routed    down     down              N/A         N/A
 Ethernet27               52      25G   9100  twentyFiveGigE28  routed    down     down              N/A         N/A
 Ethernet28               51      25G   9100  twentyFiveGigE29  routed    down     down              N/A         N/A
 Ethernet29               41      25G   9100  twentyFiveGigE30  routed    down     down              N/A         N/A
 Ethernet30               49      25G   9100  twentyFiveGigE31  routed    down     down              N/A         N/A
 Ethernet31               60      25G   9100  twentyFiveGigE32  routed    down     down              N/A         N/A
 Ethernet32               50      25G   9100  twentyFiveGigE33  routed    down     down              N/A         N/A
 Ethernet33               58      25G   9100  twentyFiveGigE34  routed    down     down              N/A         N/A
 Ethernet34               57      25G   9100  twentyFiveGigE35  routed    down     down              N/A         N/A
 Ethernet35               59      25G   9100  twentyFiveGigE36  routed    down     down              N/A         N/A
 Ethernet36               62      25G   9100  twentyFiveGigE37  routed    down     down              N/A         N/A
 Ethernet37               63      25G   9100  twentyFiveGigE38  routed    down     down              N/A         N/A
 Ethernet38               64      25G   9100  twentyFiveGigE39  routed    down     down              N/A         N/A
 Ethernet39               65      25G   9100  twentyFiveGigE40  routed    down     down              N/A         N/A
 Ethernet40               66      25G   9100  twentyFiveGigE41  routed    down     down              N/A         N/A
 Ethernet41               61      25G   9100  twentyFiveGigE42  routed    down     down              N/A         N/A
 Ethernet42               68      25G   9100  twentyFiveGigE43  routed    down     down              N/A         N/A
 Ethernet43               69      25G   9100  twentyFiveGigE44  routed    down     down              N/A         N/A
 Ethernet44               67      25G   9100  twentyFiveGigE45  routed    down     down              N/A         N/A
 Ethernet45               71      25G   9100  twentyFiveGigE46  routed    down     down              N/A         N/A
 Ethernet46               72      25G   9100  twentyFiveGigE47  routed    down     down              N/A         N/A
 Ethernet47               70      25G   9100  twentyFiveGigE48  routed    down     down              N/A         N/A
 Ethernet48      77,78,79,80     100G   9100     hundredGigE49  routed    down     down  QSFP28 or later         N/A
 Ethernet52      85,86,87,88     100G   9100     hundredGigE50  routed    down     down  QSFP28 or later         N/A
 Ethernet56      93,94,95,96     100G   9100     hundredGigE51  routed    down     down              N/A         N/A
 Ethernet60     97,98,99,100     100G   9100     hundredGigE52  routed    down     down              N/A         N/A
 Ethernet64  105,106,107,108     100G   9100     hundredGigE53  routed    down     down              N/A         N/A
 Ethernet68  113,114,115,116     100G   9100     hundredGigE54  routed    down     down              N/A         N/A
 Ethernet72  121,122,123,124     100G   9100     hundredGigE55  routed    down     down  QSFP28 or later         N/A
 Ethernet76  125,126,127,128     100G   9100     hundredGigE56  routed    down     down  QSFP28 or later         N/A
 Ethernet80              129      10G   9100     mgmtTenGigE57  routed    down     down              N/A         N/A
 Ethernet81              128      10G   9100     mgmtTenGigE58  routed    down     down              N/A         N/

Configuring the Underlay Routing

I’m a big fan of automation and configuration simplicity. I strongly believe that if i can automate with “notepad” using blind copy/paste i have good templates for fancier automation. For this reason i really think that the use of unnumbered interfaces are a great way to configure spine/leaf interfaces.

First step then, is to configure all fabric interfaces with a proper MTU and ip unnumbered as follows in the example below. Please notice that this post isn’t meant to be a full configuration tutorial.

config loopback add Loopback0
config interface ip add Loopback0 10.0.0.1/32
 
config interface ip unnumbered add Ethernet120 Loopback0
config interface mtu Ethernet120 9216
config interface startup Ethernet120
 
... Repeat for all interfaces facing a leaf ...
config save -y

A leaf switch would be configured exactly the same way, but I need also to add a second loopback interface to be used as VTEP source interface. As this loopback will act as an MC-LAG Anycast IP, both leafs in MC-LAG will have the same exact IP on their loopback 0

config loopback add Loopback0
config interface ip add Loopback0 10.0.0.11/32
config loopback add Loopback1
config interface ip add Loopback1 11.11.11.111/32

config interface ip unnumbered add Ethernet72 Loopback0
config interface mtu Ethernet72 9216
config interface description Ethernet72 "LINK_TO_SPINE_1"
config interface startup Ethernet72
  
config interface ip unnumbered add Ethernet76 Loopback0
config interface mtu Ethernet76 9216
config interface description Ethernet76 "LINK_TO_SPINE_2"
config interface startup Ethernet76
config save -y

At this point we need to configure OSPF between leafs and spines.
Unfortunately, advanced routing configs can only be applied inside the FRR container, so we need to switch to the FRR shell using the command “vtysh” first. From there on, really there is almost no difference with the well known cisco-like cli.

The biggest downside of this lack of integration is that FRR config needs to be saved separately from the rest of SONiC’s config, and we also need to tell SONiC to look at the routing config in a different place from the rest. To do that, we also need to apply the “config routing_config_mode split” command and most importantly, you need to reboot the box as the warning message will tell you. Failure to do so, will cause the switch to loose the FRR config in case of reload.

vtysh
 conf t
  !
  bfd
  !
  router ospf
   ospf router-id 10.0.0.11
   log-adjacency-changes
   auto-cost reference-bandwidth 100000
  !
  interface Ethernet72
   ip ospf area 0.0.0.1
   ip ospf bfd
   ip ospf network point-to-point
  !
  interface Ethernet76
   ip ospf area 0.0.0.1
   ip ospf bfd
   ip ospf network point-to-point
  !
  interface Loopback0
   ip ospf area 0.0.0.1
  !
  interface Loopback1
   ip ospf area 0.0.0.1
  end
 write memory
 exit
config routing_config_mode split
config save -y

Once everything is configured, from FRR we can check our routing:

SONIC-Leaf301# show ip ospf neighbor 

Neighbor ID     Pri State           Dead Time Address         Interface            RXmtL RqstL DBsmL
10.0.0.1          1 Full/DROther      33.775s 10.0.0.1        Ethernet72:10.0.0.11     0     0     0
10.0.0.2          1 Full/DROther      33.968s 10.0.0.2        Ethernet76:10.0.0.11     0     0     0

SONIC-Leaf301# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware

O>*  10.0.0.1/32 [110/11] via 10.0.0.1, Ethernet72 onlink, 00:03:08
O>*  10.0.0.2/32 [110/11] via 10.0.0.2, Ethernet76 onlink, 00:03:18
C *  10.0.0.11/32 is directly connected, Ethernet76, 00:05:08
C *  10.0.0.11/32 is directly connected, Ethernet72, 00:05:08
O    10.0.0.11/32 [110/10] via 0.0.0.0, Loopback0 onlink, 00:05:14
C>*  10.0.0.11/32 is directly connected, Loopback0, 00:05:15
O>*  10.0.0.12/32 [110/12] via 10.0.0.1, Ethernet72 onlink, 00:03:08
  *                        via 10.0.0.2, Ethernet76 onlink, 00:03:08
O>*  10.0.0.13/32 [110/12] via 10.0.0.1, Ethernet72 onlink, 00:03:08
  *                        via 10.0.0.2, Ethernet76 onlink, 00:03:08
O>*  10.0.0.14/32 [110/12] via 10.0.0.1, Ethernet72 onlink, 00:03:08
  *                        via 10.0.0.2, Ethernet76 onlink, 00:03:08
O>*  10.10.10.2/31 [110/12] via 10.0.0.1, Ethernet72 onlink, 00:03:08
  *                         via 10.0.0.2, Ethernet76 onlink, 00:03:08
O    11.11.11.111/32 [110/10] via 0.0.0.0, Loopback1 onlink, 00:05:14
C>*  11.11.11.111/32 is directly connected, Loopback1, 00:05:15
O>*  11.11.11.113/32 [110/12] via 10.0.0.1, Ethernet72 onlink, 00:03:08
  *                           via 10.0.0.2, Ethernet76 onlink, 00:03:08

SONIC-Leaf301# ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.243 ms
64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.186 ms
^C
--- 10.0.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1011ms
rtt min/avg/max/mdev = 0.186/0.214/0.243/0.032 ms
SONIC-Leaf301# ping 10.0.0.2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.215 ms
^C
--- 10.0.0.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.215/0.215/0.215/0.000 ms

admin@SONIC-Leaf301:~$ traceroute 10.0.0.13
traceroute to 10.0.0.13 (10.0.0.13), 30 hops max, 60 byte packets
1 10.0.0.1 (10.0.0.1) 0.218 ms 10.0.0.2 (10.0.0.2) 0.178 ms 10.0.0.1 (10.0.0.1) 0.126 ms
2 10.0.0.13 (10.0.0.13) 0.439 ms 0.461 ms 0.468 ms

Now that every loopback is reachable (and we see also ECMP across the two spines) is time to configure MC-LAG between or leafs as well as underlay routing across the peer-link. This step can be done successfully only now because the MC-LAG peer has to be reachable via the fabric.

config portchannel add PortChannel1       
config interface mtu PortChannel1 9216

config interface mtu Ethernet48 9216
config interface description Ethernet48 "Peer-link"
config interface startup Ethernet48

config interface mtu Ethernet52 9216
config interface description Ethernet52 "Peer-link"
config interface startup Ethernet52

config portchannel member add PortChannel1 Ethernet48
config portchannel member add PortChannel1 Ethernet52
config mclag add 1 10.0.0.11 10.0.0.12 PortChannel1
 
config vlan add 3965
config vlan member add 3965 PortChannel1
config mclag unique-ip add Vlan3965
config interface ip add Vlan3965 10.10.10.0/31

vtysh
 conf t
  !
  interface Vlan3965
   ip ospf area 0.0.0.1
  end
 write memory
 exit
config save -y

Once done, we should see our additional OSPF peer and a working MC-LAG cluster

admin@SONIC-Leaf301:~$ vtysh 

Hello, this is FRRouting (version 7.2-sonic).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

SONIC-Leaf301# show ip ospf neighbor 

Neighbor ID     Pri State           Dead Time Address         Interface            RXmtL RqstL DBsmL
10.0.0.1          1 Full/DROther      30.645s 10.0.0.1        Ethernet72:10.0.0.11     0     0     0
10.0.0.2          1 Full/DROther      30.899s 10.0.0.2        Ethernet76:10.0.0.11     0     0     0
10.0.0.12         1 Full/DR           36.717s 10.10.10.1      Vlan3965:10.10.10.0      0     0     0

SONIC-Leaf301# exit

admin@SONIC-Leaf301:~$ sonic-cli 
SONIC-Leaf301# show mclag brief
 
Domain ID            : 1
Role                 : active
Session Status       : up
Peer Link Status     : up
Source Address       : 10.0.0.11
Peer Address         : 10.0.0.12
Peer Link            : PortChannel1
Keepalive Interval   : 1 secs
Session Timeout      : 30 secs
System Mac           : 80:a2:35:81:dd:f0
 
 
Number of MLAG Interfaces:0

Everything works as expected, but we also faced yet another SONiC problem. We needed to configure interfaces and their IP addresses, OSPF and MC-LAG, to do this, we needed access to 3 different configuration shells (Linux CLI, VTYSH and sonic-cli) either for configuration or to be able to run show commands and verify our configs.

Configuring BGP-EVPN control plane

Now it’s time to configure BGP. As per our architecture, I’ll be configuring iBGP with Route Reflectors sitting on the Spines. To do so i’ll need FRR shell.
The spine config will look something similar to this one:

vtysh
 conf t
  !
  router bgp 65000
   bgp router-id 10.0.0.1
   bgp log-neighbor-changes
   neighbor FABRIC peer-group
   neighbor FABRIC remote-as 65000
   neighbor FABRIC update-source Loopback0
   bgp listen range 10.0.0.0/24 peer-group FABRIC
  !
  address-family l2vpn evpn
   neighbor FABRIC activate
   neighbor FABRIC route-reflector-client
   advertise-all-vni
   exit-address-family
  end
 exit

And the leafs, like this:

vtysh
 conf t
  !
  router bgp 65000
   bgp router-id 10.0.0.11
   bgp log-neighbor-changes
   neighbor 10.0.0.1 remote-as 65000
   neighbor 10.0.0.1 update-source Loopback0
   neighbor 10.0.0.2 remote-as 65000
   neighbor 10.0.0.2 update-source Loopback0
  !
  address-family l2vpn evpn
   neighbor 10.0.0.1 activate
   neighbor 10.0.0.2 activate
   advertise-all-vni
   advertise ipv4 unicast
  exit-address-family
  end
 exit

Once done, I should be able to see all peering formed on my spines:

SONIC-Spine31# show bgp l2vpn evpn summary
BGP router identifier 10.0.0.1, local AS number 65000 vrf-id 0
BGP table version 0
RIB entries 16, using 3072 bytes of memory
Peers 4, using 82 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
*10.0.0.11 4 65000 5 29 0 0 0 00:01:33 0
*10.0.0.12 4 65000 6 30 0 0 0 00:01:34 0
*10.0.0.13 4 65000 7 31 0 0 0 00:01:40 0
*10.0.0.14 4 65000 7 31 0 0 0 00:01:44 0
Total number of neighbors 4
- dynamic neighbor
4 dynamic neighbor(s), limit 100
Total number of neighbors established

At this point, the only missing piece is to configure the VTEP on the leaf switches as well as the anycast gateway’s mac-address; fortunately this is very simple and straight forward:

config vxlan add nve1 11.11.11.111
config vxlan evpn_nvo add nvo1 nve1
config ip anycast-mac-address add aa:aa:bb:bb:cc:cc
root@SONIC-Leaf301:/home/admin# show vxlan interface           
VTEP Information:

        VTEP Name : nve1, SIP  : 11.11.11.111
        NVO Name  : nvo1,  VTEP : nve1
        Source interface  : Loopback1

root@SONIC-Leaf301:/home/admin# show ip static-anycast-gateway 
Configured Anycast Gateway MAC address: aa:aa:bb:bb:cc:cc
IPv4 Anycast Gateway MAC address: enable

In short…

We configured a fully functional fabric providing underlay connectivity and EVPNcontrol plane as follows:

  1. A unique loopbacks on every switch
  2. Each physical interface between spine and leaf as an ip unnumbered interface
  3. OSPF area 1 within the fabric
  4. MC-LAG and underlay peering across the peer-link
  5. iBGP EVPN between leaf and spines with RR on the spines themeselves
  6. Each MC-LAG pair as a unique Virtual VTEP.

We also noticed how while the configuration isn’t complicated by any mean, the need to move between multiple shells just to apply or verify configs can be very confusing to the end user. To be fair though, the SONiC community is working on improving this part by working on delivering a single unified shell.

FRR config is always very familiar as it resembles Cisco’s IOS cli; on the other end the basic Sonic CLI can be a bit frustrating at times, especially due to the fact that it’s case sensitive making typos easy to occur.

In the next blog post we will look how to actually configure VXLANs and server facing interfaces… stay tuned!

SONiC and White Box switches in the Enterprise DC! – Part 1

In recent years two buzz words began to arise: open-networking and white box switches. Those two words go often hand-in-hand with each other. They are often promoted by big names like Facebook or Microsoft.
From the software side, SONiC is maybe the biggest player out there as it powers Microsoft Azure’s cloud, while from the hardware side, Accton has arguably been one of the most important vendors.

The truth though, at least in my opinion, is that while this innovation is great it is not ready to be embraced by everyone yet. Only companies willing to make this “leap of faith” can take advantage of all of this, but what about us poor mortals? Are SONiC and white boxes ready to be widely deployed? Well let’s give it a look!

We will be deploying a simple VXLAN-EVPN Fabric like in the picture below and we will be checking how difficult is to configure and troubleshoot the fabric, but also and most importantly if this common Enterprise design actually works.

The Hardware

For our spines we’ll be using Edge-Core’s AS7816-64X, powered by Broadcom’s Tomahawk II chipset. This switch is a 2RU lean spine providing 64x 40/100 Gbps QSF28 ports.

For the leafs, we’ll be using Edge-Core’s AS7326-56X, powered by Broadcom’s Trident III chipset. This switch is a 1RU TOR providing 48x 1/10/25 Gbps SFP28 and 8x 40/100 Gbps QSFP28 ports.

The Software

As for the software, we will be focusing on SONiC version 3.0.1.
This version introduces support to VXLAN-EVPN among many other things, that in my opinion, makes it ready for a more wide spread distribution.

The Architecture

Looking at SONiC’s features, we will try to implement the architecture below.
Some choices though, like the usage of a Virtual-VTEP as opposed of EVPN Multi-homing or ingress-replication for BUM traffic, are dictated only by SONiC support of such configurations.

SPINE/LEAF POD

├── ENDPOINT ACCESS
│ │
│ └── MCLAG with Virtual VTEP (all NLRI advertised with VIP as NH)
├── UNDERLAY
│ │
│ ├── Routing
│ │ │
│ │ └── OSPF
│ └── EVPN NLRI exchange
│ │
│ ├── iBGP
│ └── Route reflection
│ │
│ └── Spine, Fabric
└── OVERLAY

├── L3 Gateway placement
│ │
│ └── Leafs
├── Distributed Anycast Gateway
│ │
│ └── Same IP-MAC
├── Service Interface
│ │
│ └── VLAN aware
└── Host communication

├── BUM traffic forwarding
│ │
│ └── Ingress Replication (EVPN Type 3)
├── Suppress ARP

└── Symmetric IRB

├── Inter-Subnet
│ │
│ └── L3 VNI
└── Intra-Subnet

└── L2 VNI

I won’t explain why I’ve chosen OSPF+iBGP, this is a discussion for another time. It suffices to say that there is no reason to reinvent the wheel as this design worked perfectly for decades in the much more complex MPLS Service Provider’s space.

In short…

In this first post, I wanted to appeal to your curiosity, and set expectation right.
Accton switches powered by Boradcom chipset will be our white box switches while SONiC is our open source operating system.
In the next one instead, we will be implementing the above design looking at SONiC CLI and we’ll try to make it work.

Spoiler alert… It works but… well… details are a lot more interesting…