Monday, December 2, 2013

Openstack RabbitMQ issues and solution

Symptoms

In our deployment, we observed that when ever there is a network interruption, any Openstack operation that requires MQs get struck for long time. For example, VM creation will be struck in "Build State"

Observations

From our observations, it took 12 to 18 minutes to recover from that state. We could see lot of messages unacknowledged and Ready state.
Also, we used to see lot of Consumers per Queue. It means there are lot of TCP connections from the consumer to the RabbitMQ broker which does not make any sense.

We did check the TCP network connections on Rabbit Server and there are indeed several of them in ESTABLISHED state while on the consumer(say in compute) there is only one network connection.

It means problematic connections were closed on the consumer  but those connections are still there in rabbit server. The health check between rabbit client and server implementations is not implemented inOpenstack code.

This state would recover but takes lot of time depending on number of consumers( for example many number of nova-computes)

Solution

We introduced load balancer and placed rabbit servers behind the Virtual service. Load Balancer implements a sort of proxy where it maintains states on each side of connection. When ever there is a problem on client side (Say) it closes connection on the server end. 
With this in place, network interruptions like switch reboot etc., did not have any affect on our Openstack deployment.
We configured load balancer to have idle inactive timeout of 90 seconds as our periodic updates from compute happens every 60 seconds. Thus, we do not close our rabbit connections un-necessarily.

Update:

There are other advantages with loadbalancer. The distribution of load from Openstack rabbit client is not so good. It takes in a list of rabbit servers and picks the first active rabbit server blindly. It does not really understand the actual load of rabbit servers. With LB in place, we can distribute the consumers on all rabbit servers.
It is from our observations, this indeed improved overall performance of Openstack with this deployment.

No comments:

Post a Comment