Understanding ACI PBR

ACI & PBR
Mapping Services to PODS

When you created the service, the integration between ACI and Kubernetes made the ACI constructs and policies to expose the external IP. To accomplish this goal ACI utilized PBR (policy based redirect). PBR makes it possible for the ACI fabric to redirect traffic into L4/L7 devices without the need for these devices to be the default gateway or routep point. This is popular construct in networks for firewalls, intrussion prevention systems, load balancers and more.

Since PBR is loaded on top of ACI contracts you can also do policy enforcement on the services that are being configured. For example, in the configuration you can create network filters so that traffic destined to a web service are only allowed to hit port 80 and 443. Making sure that the load balancer is receiving only traffic that policy allows for a more secure network.

The below diagram is a representation of what the integration between ACI and Kubernetes built to provide the service you defined to be exposed for the simple MyLabAPP. This provides you with an easy ability to load balance in your enterprise private network services that are developed for the platform. Since ACI will do the load balancing, as you scale the containers in Kubernetes to handle the service load capacity, ACI automatically will redirect the traffic into the right containers.

Since there is no defined bridge domain that would contain the subnet that is defined for the dynamic service, you have to build a static route from the adjacent router into the ACI fabric. For this lab we just pre-built the static routes for you to simplify the lab complexity.

The configuration that is deployed in ACI starts in the External Layer 3 policy object. Looking in the ACI fabric under the common tenant you can find this policy.

The service assigned IP should be the same as what you have in value when doing the command kubectl get pods -o wide. Another easy way to get this information is to access this via the ACI fabric APIC.

The ACI APIC leverages the same NX-OS CLI that can be used to view the fabric objects just like you can view these from the web interface. In the Credentials link on the top of this web page you have a link to connect via SSH to the APIC.


show tenant common vrf k8s_vrf external-l3 epg k8s_pod09_svc_default_mylabapp detail

 Name            Flags                      Match               Node        Entry               Oper State 
 --------------  -------------------------  ------------------  ----------  ------------------  ---------- 
 common:         vxlan: 2555909             10.0.146.67/32      node-209    10.0.146.67/32     enabled    
 k8s_pod09_svc_  vrf: k8s_vrf                                   node-207    10.0.146.67/32     enabled    
 default_mylaba  Target dscp: unspecified                       node-210    10.0.146.67/32     enabled    
 pp              qosclass: unspecified                          node-208    10.0.146.67/32     enabled    
                 Contracts                                                                                 
                 ---------                                                                                 
                 Provided: k8s_pod09_svc_d                                                                 
                 efault_mylabapp                                                                           
                 Consumed:                                                                                                            

Using the same CLI we can also look at the created graph in the fabric.


show l4l7-graph tenant common graph k8s_pod09_svc_global

Graph           : common-k8s_pod09_svc_global
Graph Instances : 1                          

Consumer EPg  : common-k8s-epg                       
Provider EPg  : common-k8s_pod09_svc_default_mylabapp
Contract Name : common-k8s_pod09_svc_default_mylabapp
Config status : applied                              

Function Node Name : loadbalancer
Service Redirect   : enabled
 Connector   Encap       Bridge-Domain  Device Interface      Service Redirect Policy        
 ----------  ----------  ----------     --------------------  ------------------------------ 
 provider    vlan-3209   common-k8s_po  interface             k8s_pod09_svc_default_mylabapp 
                         d09_bd_kubern                                                       
                         etes-service                                                        
 consumer    vlan-3209   common-k8s_po  interface             k8s_pod09_svc_default_mylabapp 
                         d09_bd_kubern                                                       
                         etes-service                                                          

As you can see the ACI/Kubernetes integration has created a bridge domain that is used for the definition of the service graph instances in the fabric.

Looking at the diagram we showed earlier, we see that the contract that is created between the EPG for the service and the external network is where the Service Graph is layered upon.

Looking in the APIC CLI we can see:


show tenant common contract k8s_pod09_svc_default_mylabapp

 Tenant      Contract    Type        Qos Class     Scope       Subject     Access-group  Dir   Description 
 ----------  ----------  ----------  ------------  ----------  ----------  ----------    ----  ----------  
 common      k8s_pod09_  permit      unspecified   vrf         loadbalanc  k8s_pod09_sv  both              
             svc_defaul                                        edservice   c_default_my                    
             t_mylabapp                                                    labapp           

One of the important value statements of the ACI integration with Kubernetes, is how service definitions in Kubernetes translates directly into ACI policies to be implemented in the fabric dynamically. If you look at the following example, any of the specified ports that are defined in the Kubernetes service YAML file will be created in the ACI fabric.

apiVersion: v1
kind: Service
metadata:
  name: mycrazyapp
  labels:
    app: mycrazyapp
spec:
  type: LoadBalancer
  loadBalancerIP: 10.0.146.67
  ports:
  - name: http
    port: 80
    targetPort: 80
  - name: https
    port: 443
    targetPort: 443
  - name: mgmt
    port: 8080
    targetPort: 8080
    selector:
  - name: data-ingress
    port: 9183
    targetPort: 9183
    app: mycrazyapp

This provides us the detail that the access-group/filter is called k8s_pod09_svc_default_mylabapp and we could verify the filter that we are using for this particular application which in this case is TCP port 80:


show tenant common access-list k8s_pod09_svc_default_mylabapp  
apic1# show tenant common access-list k8s_pod09_svc_default_mylabapp
Tenant : common
Access-List : k8s_pod09_svc_default_mylabapp

  match tcp dest 80

You can also look at at the PBR services created in ACI based on the Kubernetes deployment. You will be able to see the hashing algorithm which in this case uses the default setting sip-dip-prototype. This means this policy will be hashing the traffic based on Source IP + Destination IP + Protocol Type.

This window also shows the destination nodes which are the services nodes. In your case it would be the IP address assigned to pod09-node1 and pod09-node2 from the OpFlex protocol.

Back in the ACI fabric:

Packet Flow

As part of the configuration for the integration to work you have to define the static route that points towards the ACI fabric leaf that has the defined L3 out. Traffic that arrives from the network towards the exposed service IP will be routed via this mechanism. The ACI fabric will never advertise the route to the network by itself.

When the packet hits the ACI fabric leaf, the L3 out policy object will probably have a default route defined leaving the fabric back towards the adjacent router. If it doesn't, this won't change the behaviour we are going to explain.

As the packet is in the L3 out object context, it would look in the VRF for a destination to send the packet and this will trigger a MISS since it is not defined anywhere in a bridge domain. For this reason, before sending the packet back out it performs a look up on the SRC IP and classifies the packet into the default EPG that has defined 0.0.0.0/0 as destination.

Now the packet should be ready to be sent out the fabric, but when ACI does the final lookup to send the packet; Longest Prefix Match (LPM) places the packet in the service EPG due to match of the DST IP to the /32 in the service EPG.

Since LPM placed this in the external service EPG that ACI/Kubernetes built the policy forces the traffic into the Service Graph and leverages the Contract that was created (in our case TCP 80) and sends it to the the Policy Based Redirect (PBR) construct. At this point, the service graph knows which destination nodes this packet is for as defined in the policy.

Once the packet reaches the destination node, the packet is then Network Address Translated (NAT) by openvswitch (OVS) that is running on that node. That operation converts the SERVICE IP destination IP address to the POD Destination IP and routed in the host. If there are various PODS on the same compute host for the servive, OVS can also do a final load balancing locally towards the various POD IP's.

In a large fabric you can observe the value of ACI managing this process for you. Any increment in the count of scalability would create more pods to manage that service. Each would get address to the service graph and the PBR would properly load balance across the fabric.

One last part of the life of the packet is that the container is listening in a service that is different than what is being sent to it. If you recall in the previous example the container was listening on port 8090. So for this solution to work, NAT also has to perform Protocol address translation. In this case when you defined the service the target port was specified and this is all the information that is needed for the ACI/Kubernetes integration to program the OpenVSwitch to perform this task.