Navigating Cloud Security firewall

Insights from Our Experience with Firewalls in the Cloud

Navigating Cloud Security firewall

Context

In the digital age, cloud security is more critical than ever as businesses increasingly transition to the cloud. Cloud environments face unique security risks, such as data breaches and compliance challenges, making firewalls essential for safeguarding networks. By filtering traffic between different security zone, firewalls enforce security policies, defend against threats, and, with features like intrusion detection, help ensure compliance. As such, they play a central role in any comprehensive cloud security strategy.

While cloud providers such as AWS, Azure, and GCP offer best practices for deploying and managing firewalls, I firmly believe that simply adhering to these guidelines is insufficient for achieving truly effective cloud firewall management. To foster a culture of collaborative learning, I’d like to share the challenges we faced, the strategies we implemented, and the valuable lessons we learned throughout the process. I am confident that by sharing this collective knowledge, we can empower IT professionals to bolster their security practices and elevate the overall security posture of the cloud community.

Given the constraints of this page, I’ll focus on sharing our experience with just one cloud platform for now.

Our Journey into Cloud Firewalls

Before migrating to the cloud in 2016, we had already established a robust set of network security standards for our on-premise infrastructure. Our cloud journey began with the deployment of a POS (Point of Sale) application on Azure, and over time, we’ve adapted and expanded these standards to ensure the continued protection of our network from a security standpoint. This involved developing a comprehensive set of functional and non-functional requirements to meet evolving security challenges.

At that time, Azure did not offer a dedicated firewall solution, the only solution is to use NSG.

An NSG is a resource used to filter network traffic to and from Azure resources in a virtual network. It contains a list of security rules that allow or deny traffic based on source/destination IP, port, and protocol. But there are so many limitations of that as I listed below:

  • Scope: NSGs operate at the network interface (NIC) or subnet level. While this is sufficient for many use cases, it can become limiting when more granular control is needed for complex networks.
  • Stateful Traffic: NSGs are stateful, meaning they track the state of network connections. However, they are designed primarily for inbound and outbound traffic filtering based on rules, and they lack deep inspection capabilities.
  • Limited Protocol Support: NSGs typically focus on Layer 3 and Layer 4 (IP addresses, ports, and protocols), which means they do not inspect traffic at higher layers (e.g., application layer).
  • No Threat Intelligence: NSGs lack advanced capabilities like automatic threat intelligence or application-layer filtering.
  • No Logging or Monitoring: While NSGs provide basic logging via Azure Monitor, they don't have advanced visibility or extensive monitoring features.

With no choice, we have to rely on the Network Security Group (NSG) as our primary tool for traffic filtering. Below is the deployment model we implemented to secure our applications.

This is how we implemented network-level protection: we opened only essential ports, such as 443, to route HTTPS traffic to our backend application servers. We established two layers of security groups—one at the subnet level and another at the VM level—to minimize our attack surface. Additionally, traffic from the application layer to the backend database is strictly controlled by network security groups (NSGs).

That was one of the most challenging times we encountered on Azure during its early days. Fortunately, it was a relatively short period, and we were able to deploy virtual appliances from Check Point, FortiGate, and Palo Alto on Azure. Indeed, the arrival of real firewalls marked a significant turning point.

We thoroughly researched how to deploy firewall appliances on Azure and even developed a proof of concept (PoC) architecture. However, the experience at that time was far from ideal, which made us hesitant to move forward with the deployment. When Azure introduced its own native firewall solution, we decided to shift our focus to evaluating that product.

Microsoft presented a compelling case for the features and advantages of their firewall solution, especially in comparison to other appliances. Below are some of the standout features I found on Microsoft’s website:

The features demonstrated by Microsoft, such as IDPS, threat intelligence, TLS inspection, and others, immediately caught our attention. These were exactly some key capabilities we needed.

We did another round of PoC and review all the pros and cons. Below are the major factors that we used to make the decision for azure firewall

Feature

Benefit

Fully Managed

No infrastructure management or maintenance required

High Availability

Automatically deployed in highly available configurations

Threat Intelligence

Real-time protection from known malicious IPs

Application & Network Layer Filtering

Stateful traffic inspection with application-level filtering

Ease of Use

Simplified deployment, rule configuration, and management

Scalable

Automatic scaling to handle varying traffic loads

Centralized Logging & Monitoring

Integration with Azure Monitor and Sentinel for enhanced visibility

Flexibility

Customizable rules and integration with other Azure services

Hybrid/Multi-Region Support

Supports deployments across multiple regions and hybrid environments

Cost-Effective

1/3 cost saving compared to other appliances solution. Especially with Palo-alto

 We deployed our first firewall solutions during the setup of our new virtual datacenter in another region, using Azure's native firewall solution.

This setup marks a significant improvement over what we had before. We now have our first Web Application Firewall (WAF) and a dedicated firewall, with both inbound and outbound traffic fully protected by Azure Firewall. It seemed like the perfect solution—until we started encountering some unexpected challenges. Let’s take a closer look at the issues that came along with it.

Key Challenges Faced Phase I (Single cloud)

I will outline the challenges we encountered during this phase and share our solutions for mitigating these issues within a single cloud environment.

Balancing security rule management without compromising effectiveness

As you can see, we have many NSGs in our environment. With the firewall layer integrated into our network, we sometimes need to open a maximum of five rules to allow traffic to pass through.

As requests continue to increase, the efficiency and time required to configure traffic as intended can be quite challenging. This has been a frustrating experience for those responsible for implementing and managing the rules.

We ultimately adopted the following model to simplify access control based on these principles:

1.        We remove the NSG attached to the machine to remove two additional control layer

2.        On the subnet NSG, we no longer manage outbound rules, opting for a single rule that allows any traffic.

3.        All traffic crossing VNETs is enforced through the firewall for inspection, and it can reach its destination if permitted by the firewall. We only need to open one rule on the firewall to facilitate this.

4.        For traffic within the same VNET, we added a deny rule with the lowest priority, while allowing only specified traffic within the same VNET with higher priority.

In summary, the firewall manages cross-VNET traffic, while the NSG handles traffic within the same VNET.

 

Firewall performance tuning

Yes, performance issues are inevitable; sooner or later, you will encounter them. With the rise of DevOps and automation tools, the configuration and management of cloud firewalls have become increasingly automated, facilitating integration with CI/CD processes. We also utilize Terraform to automate the implementation and management of our firewall rules, which makes it easy to deploy and roll back changes. We established a naming standard for our rule implementations, which worked quite well for almost six months. However, as we continued to update more rules in our firewall policy, our monitoring metrics indicated that the latency of the firewall was progressively increasing.

The firewall performance became unstable. At times, we received reports that applications were running slowly, yet the overall metrics from the Azure portal indicated nothing abnormal.

With support from Microsoft, they acknowledged that some of the firewall nodes in the cluster were operating in a critical state, though this was not reflected in the front-end portal.

Another limitation is the number of rules that the firewall can handle. Azure firewall can only handle. 20,000 unique source/destinations in network rules. The algorithm of counts like below:

Unique source/destinations in network = (Source addresses + Source IP Groups) * (Destination addresses + Destination Fqdn count + Destination IP Groups) * (IP protocols count) * (Destination ports)

Azure firewall also has a limit for process the FQDN in network rules. For good performance, we can’t exceed more than 1000 FQDNs across all network rules per firewall.

With IDPS feature turns on. The performance of Azure firewall will be degraded dramatically.

Nevertheless, despite all the tuning recommendations from the Azure product team, it seems there isn’t much more we can do. For instance, they helped us provision more nodes than usual in the cluster. While these adjustments have reduced the frequency of issues, we hope to see further optimization of their firewall algorithms or any other helpful solutions.

What should we do? We decided to build additional firewall instances to separate their functions. You may find the functions that we used on different firewall

 

Firewall features

External Firewall feature

Hub Firewall features

o   TLS inspection

o   IDPS

o   URL filtering

o   Web categories

o   Threat intelligence

o   DNS proxy

o   Outbound SNAT

o   Inbound DNAT

o   Network traffic filtering rules

o   Application FQDN filtering rules

o   IDPS

o   Threat intelligence

o   Inbound DNAT

o   Network traffic filtering rules

o   Application FQDN filtering rules

o   TLS inspection

o   IDS

o   URL filtering

o   Web categories

o   Threat intelligence

o   DNS proxy

o   Outbound SNAT

o   Network traffic filtering rules

o   Application FQDN filtering rules

Our new deployment model has been incorporated into our virtual datacenter architecture in the third region, as illustrated below.

Based on our experience, separating the firewall features has greatly simplified our operations, particularly when it comes to IDPS capabilities. This separation has been a key factor in improving performance. Below are some of the key benefits we've gained from using two firewall configurations:

  1. Rule Optimization: We’ve been able to split the rules across two firewalls, reducing the overall number of rules in each. This has significantly improved processing times.
  2. Workload Distribution: Key functions can be allocated to different firewalls based on specific security requirements, allowing us to balance the load and enhance performance without compromising functionality.
  3. Scalability: By isolating the front-layer firewall functionality, we’ve made it easier to scale the front-end firewall in response to increased inbound internet traffic.

Visibility and Monitoring

With the rich features provided by the new firewall, there is a significant need for log storage, and integrating with other security tools presents its own challenges. Ensuring seamless integration between cloud firewalls and other security solutions—such as intrusion detection systems, SIEMs, and endpoint protection—can be complex, particularly in multi-cloud or hybrid environments. A key challenge we face in day-to-day operations is troubleshooting and monitoring.

To be honest, using Azure's Network Watcher for complex network troubleshooting is not straightforward.

With the setup outlined above, we can obtain near real-time traffic data for troubleshooting and monitoring. Below you can find that we may use it to understand what is the  real time traffic of specific application.

Also, we can have the real time traffic deny log for troubleshooting.

You must pay for the logs that are ingested into log analytics workspace and the cost is very high when you store everything in workspace.  Our solution to this is to keep 30 days logs for troubleshooting and network flow trace. All other logs older than 30 days will be archived into storage accounts as per legal compliance requirements.

Closure

This brings us to the end of our discussion for now, but our journey continues. While we may not have covered everything in a single article, I look forward to the opportunity to share our success stories related to firewall automation and integration with ITSM tools, as well as the challenges of managing multi-cloud environments. Additionally, I’d like to explore the latest architecture we’re using on Alibaba Cloud.

Overall, the solution has become increasingly user-friendly, with many longstanding issues addressed through the provider's product evolution. However, this progress also introduces new challenges. In short, the cloud is in constant flux, and we must adapt dynamically.

I hope this offers valuable insights for those genuinely interested in the topic. It would be a great honor to know that someone finds value in these real experiences, rather than relying solely on best practice guides from documentation websites.