In one of our Service Fabric services, we host a TCP socket to feed data into another system. The implementation is simple, using a regular TcpClient to accept listeners and read and write to them. As part of the behaviour of this socket, we listen for an initiation command after connecting.
To enable the service to be hosted in a live cluster and open up that socket in Azure, we had to add a Load Balancing rule in the Azure Load Balancer, with a Health Probe to determine what nodes the service is deployed to. The probe hits the same TCP socket, and according to documentation, should just try to connect, perform a 3-way handshake, and indicate a node as healthy if it succeeds. This process turned out to be a bit more complicated.
When our other system, or a tester application, was hitting the public port, it was timing out. It seemed the port simply wasn't opened by the Load Balancer. When we ran our service locally or logged in to the Service Fabric node, the services were working fine. Only when we hit the public port as configured in the Load Balancer rule did we get timeouts.
We noticed from our logging that our service was being connected to by the probe at regular intervals - and that the connection succeeded - but the probe was still marking our services as unhealthy, and ending up closing the public port entirely. If we moved the probe over to a port on the machines we know is open (such as RDP on 3389), we could get traffic through and connect as we expected. As such, we know the problem was in some way related to the handling of the TCP socket in our code.
After some experimentation, we determined that the problem was that our socket was never forcefully closed. Our socket kept waiting for an initation command, and the load balancer probe, being unable to send one, times out its connection after a while. Once we added our own timeout on receiving an initiation command, and gracefully shut down the connection if we don't receive one, the probe started recognizing our service as healthy again.
So, while the Azure Load Balancer documentation specifies a service will get marked as healthy if the TCP handshake succeeds, it will actually wait for the connection to close before doing so. If the connection doesn't get closed, it gets marked as unhealthy and the public port will be unable to pass through any traffic. Lessons learned :)