yanto.fi

Banning a bot IP that iptables could not ban

| 2715 words | 13 minutes

#fail2ban #iptables #tcpdump

Recently, I was configuring fail2ban on a server and testing that it actually works by looking at incoming logs. Thankfully, there were a few bots that consistently hit the server, so I had plenty of testers. However, the was one bot that I simply could not block! This post describes how I came to understand why I wasn’t able to block that one IP and how I managed to block it in the end. (Note: I’m far from a networking expert, so for advanced readers might find the conclusion obvious).

The server we’re interested in this case hosts a gitea instance, which has both HTTP and SSH services. The same server hosts other services, so for the easy maintenance, gitea is deployed as a docker container and sits behind traefik, a reverse proxy that handles TLS termination, which is also deployed as a container.

Network overview

Network overview

Part 1: Early wins with SSH

I started from gitea’s SSH service. As I understand it, fail2ban configuration consists of few components – how to find the logs, how to parse them, and how to ban specific IPs.

As docker is configured to send the logs to journald and fail2ban has support for systemd backend, defining the log source is as easy as setting backend = systemd. However, this configuration alone would require fail2ban to parse all system logs, most of which are not really required here. There’s multiple ways to filter journal data, but to find the relevant keys and values, one can use -o json-pretty output mode (some output omitted for brevity):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ journalctl -n 1 -o json-pretty --no-pager
{
    "_BOOT_ID" : "918f75cfc8a840dfa56ad2e3deaf7f71",
    "_CMDLINE" : "/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock",
    "_COMM" : "dockerd",
    "_EXE" : "/usr/bin/dockerd",
    "_PID" : "4732",
    "_SYSTEMD_UNIT" : "docker.service",
    "CONTAINER_ID" : "b20d1d4e4be8",
    "CONTAINER_NAME" : "git",
    "IMAGE_NAME" : "gitea/gitea:1.20.3",
    ...
}

For this case, we’re interested in logs for specific docker container, so we can define the filter as follows.

1
journalmatch = _SYSTEMD_UNIT=docker.service CONTAINER_NAME=git

Next, parsing the logs is a matter of defining failregex and ignoreregex parameters. To validate that the regexes are picking what they need to pick, use fail2ban-regex with --print-all-missed.

1
2
$ fail2ban-regex <source> <jail.conf>
$ fail2ban-regex systemd-journal /etc/fail2ban/filter.d/gitea-sshd.conf --print-all-missed

This way I found that the regexes in default sshd jail didn’t work in this case and I ended up slightly modifying this gist to arrive at a working configuration. All in all, the gitea-ssh specific configuration looked as such:

File: /etc/fail2ban/filter.d/gitea-sshd.conf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
[INCLUDES]
before = common.conf

[Definition]
failregex = .+User .+ from <HOST> not allowed because not listed in AllowUsers\s*.+$
            .+(?:llegal|nvalid) user .* from <HOST>\s*.+$
            .+User .+ from <HOST> not allowed because listed in DenyUsers\s*.+$
            .+User .+ from <HOST> not allowed because not in any group\s*.+$
            .+refused connect from \S+ \(<HOST>\)\s*.+$
            .+Received disconnect from <HOST>: 3: \S+: Auth fail.+$
            .+Disconnected from invalid user .+ <HOST> .+$
            .+User .+ from <HOST> not allowed because a group is listed in DenyGroups\s*.+$
            .+User .+ from <HOST> not allowed because none of user's groups are listed in AllowGroups\s*.+$
ignoreregex = 

[Init]
maxlines = 1
journalmatch = _SYSTEMD_UNIT=docker.service CONTAINER_NAME=git
File: /etc/fail2ban/jail.d/gitea-jail.local
1
2
3
4
5
6
[gitea-sshd]
enabled     = true
chain       = DOCKER-USER
port        = 2222
backend     = systemd
banaction   = docker-action
File: /etc/fail2ban/action.d/docker-action.conf
1
2
3
4
[Definition]
actioncheck = iptables -n -L FORWARD | grep -q 'DOCKER-USER[ \t]'
actionban   = iptables -I DOCKER-USER -s <ip> -j DROP
actionunban = iptables -D DOCKER-USER -s <ip> -j DROP

And that… seemed to work! 🎉 (Note, all IPs in this post are redacted.)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
2d3e3e202a3b: User root from 182.0.0.0 not allowed because not listed in AllowUsers
2d3e3e202a3b: Received disconnect from 182.0.0.0 port 27798:11: Bye Bye [preauth]
2d3e3e202a3b: Disconnected from invalid user root 182.0.0.0 port 27798 [preauth]
fail2ban.filter: INFO [gitea-sshd] Found 182.0.0.0 - 2023-09-06 21:10:30
...
2d3e3e202a3b: Invalid user yang from 182.0.0.0 port 25854
2d3e3e202a3b: Received disconnect from 182.0.0.0 port 25854:11: Bye Bye [preauth]
2d3e3e202a3b: Disconnected from invalid user yang 182.0.0.0 port 25854 [preauth]
fail2ban.filter: INFO [gitea-sshd] Found 182.0.0.0 - 2023-09-06 21:11:34
fail2ban.actions: NOTICE [gitea-sshd] Ban 182.0.0.0

Part 2: HTTP is going to be easy

Similar configurations were then applied to the HTTP service. The logs now come from an access.log written by traefik and fail2ban reads it and bans IP based on that. Easy!

fail2ban.filter: INFO [traefik-botsearch] Found 51.0.0.0 - 2023-09-06 23:32:19
fail2ban.actions: WARNING [traefik-botsearch] 51.0.0.0 already banned

Wait a minute… WARNING: already banned means that fail2ban has already banned the IP in the past, but that IP can still access the resource, thus ending up again in the logs, which are parsed by fail2ban. So, something’s not right here. 🤔

Good thing here was that there was a particularly insisting bot that continuously spammed the same request with a period of 1-2 hours. This bot proved to be a good testing companion, as that period was enough for me to think about new ideas and implement them before the next request would come again… and pass through fail2ban again!

143.0.0.0 - - [06/Sep/2023:02:27:15 +0000] "GET / HTTP/1.1" 404 19 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 16_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Mobile/15E148 Safari/604.1" 834 "-" "-" 0ms

Part 3: Checking iptables rules

Let’s see… if fail2ban is able to notice all of this, then it’s not an issue with parsing configuration. As defined in the docker-action.conf (see above), the ban and unban actions insert and delete entries in the iptables’ DOCKER-USER chain. Let’s verify that these rules are actually created:

1
2
3
4
5
6
$ iptables -S
    ...
    -A DOCKER-USER -s 182.0.0.0/32 -j DROP
    -A DOCKER-USER -s 5.0.0.0/32 -j DROP
    -A DOCKER-USER -s 167.0.0.0/32 -j DROP
    -A DOCKER-USER -j RETURN

Yeah, the rules are created alright. The same command can be augmented to show packet information:

1
2
3
4
5
6
$ iptables -L DOCKER-USER -n -v | egrep -v "    0     0"
Chain DOCKER-USER (1 references)
 pkts bytes target     prot opt in     out     source               destination
   20  1200 DROP       all  --  *      *       163.0.0.0            0.0.0.0/0
    2   104 DROP       all  --  *      *       147.0.0.0            0.0.0.0/0
 935K  212M RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0

Here it shows that some of the rules are getting used, as iptables has registered non-zero packet statistics on those rules. That is, fail2ban successfully creates iptables rules and iptables successfully uses them to drop traffic from specific hosts… However, the bot IP we’re interested in doesn’t register any packets – traefik continues to log a single request from that IP, approximately once per hour, and iptables shows:

1
2
$ iptables -L DOCKER-USER -n -v | grep '143.'
    0     0 DROP       all  --  *      *       143.0.0.0            0.0.0.0/0

Okay… 🤔 so, what if the rules are arranged in a manner that this request is able to hit a previous rule, and get a pass that way? iptables is actually a collection of tables and chains, with two main independent paths: through INPUT or through FORWARD. The DOCKER-USER chain is part of the FORWARD chain, so maybe the packets pass through INPUT chain and hence also pass the filters…?

There seems to be a lot of different options and solutions that different folks suggest to debug iptables routing (ref1, ref2, ref3, ref4). Not all of those solutions seemed to work for my system, but here’s some of the commands I executed, for your inspiration. Some of these may work, some may not. This area is the most unclear to me.

1
2
3
4
$ modprobe ipt_LOG
$ iptables -t raw -A PREROUTING -s 143.0.0.0/32 -j TRACE
$ iptables -t raw -A OUTPUT -s 143.0.0.0/32 -j TRACE
$ iptables -A FORWARD -s 143.0.0.0/32 -j LOG --log-prefix "RULE4:" --log-level 7

Nevertheless, I was able to capture and view a single packet of my own, destined for the docker container. There’s a lot of info, but one thing is clear – the packet goes into FORWARD chain and then immediately jumps to DOCKER-USER chain. So, if my packets destined for the docker container end up in the correct chains, then it’s probable that the bot’s packets also pass through. In short, there’s absolutely no issue with iptables rules…

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ xtables-monitor -t
PACKET: 2 a9939f32 IN=enp0s25 OUT=docker0 MACSRC=18:00:00:00:00:00 MACDST=f0:00:00:00:00:00 MACPROTO=0800 SRC=10.0.0.0 DST=172.0.0.0 LEN=60 TOS=0x0 TTL=63 ID=46275DF SPORT=39904 DPORT=8080 SYN
 TRACE: 2 a9939f32 filter:FORWARD:rule:0x19:JUMP:DOCKER-USER  -4 -t filter -A FORWARD -j DOCKER-USER
 TRACE: 2 a9939f32 filter:DOCKER-USER:return:
 TRACE: 2 a9939f32 filter:FORWARD:rule:0x14:JUMP:DOCKER-ISOLATION-STAGE-1  -4 -t filter -A FORWARD -j DOCKER-ISOLATION-STAGE-1
 TRACE: 2 a9939f32 filter:DOCKER-ISOLATION-STAGE-1:return:
 TRACE: 2 a9939f32 filter:FORWARD:rule:0x12:JUMP:DOCKER  -4 -t filter -A FORWARD -o docker0 -j DOCKER
 TRACE: 2 a9939f32 filter:DOCKER:rule:0x1a:ACCEPT  -4 -t filter -A DOCKER -d 172.0.0.0/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 8080 -j ACCEPT
 TRACE: 2 a9939f32 nat:POSTROUTING:return:
 TRACE: 2 a9939f32 nat:POSTROUTING:policy:ACCEPT

Okay, let’s recap. There’s a bot sending HTTP requests to traefik and ending up in the access.logs. We can notice this with fail2ban and create iptables rules to block it, but that does absolutely nothing to stop the bot. There aren’t many ideas left to investigate. So, what if we’ll catch the packets as they are and see if their contents can tell us something?

Part 4: Using tcpdump to see the magic packets

Let’s use tcpdump to capture traffic with the specific bot IP and compare against access.log timestamps.

1
$ tcpdump host 143.0.0.0 -X

After a while, the access.log shows a new request… and tcpdump remains quiet. Waaait, what. How is it possible that a packet arrives but skips all the iptables rules and all of the tcpdump things and just magically appears in the access.log?? Let’s remove the filter and see what else is happening on the same machine at the same time.

1
$ tcpdump -w traffic.pcap

After a while, a new request comes and is logged to access.log. Okay, open that pcap in wireshark and there’s one HTTP request with the same headers. Looks promising so far. wireshark can show the whole TCP stream as a single conversation, so let’s open that up and see:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
PROXY TCP4 143.0.0.0 178.0.0.0 50052 80
GET / HTTP/1.1
Host: 178.0.0.0
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 16_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Mobile/15E148 Safari/604.1
Accept: */*
Accept-Encoding: gzip

HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Thu, 07 Sep 2023 17:01:05 GMT
Content-Length: 19

404 page not found

The bot IP is there!!

And it’s proxied!!

And 178.0.0.0 is IP of an ingress server!!

There was one crucial detail to all of this that I forgot completely – that I had an ingress server. That was one piece of this infrastructure which I setup long time ago in the cloud and forgot about it as it didn’t require any changes. The ingress server proxies 80/443 to the traefik and that’s it, that’s all it does. It wraps the request with PROXY protocol and traefik unwraps it, and therefore, the access logs show the client IP, even through traefik only communicates with the ingress. So, that’s the reason why iptables rules didn’t work. 🤦

Curiously, the SSH fail2ban rules worked because of misconfiguration of the DNS records. You see, they were configured to skip the ingress server entirely. Hence, the gitea ssh server communicated directly with the clients, without any PROXY protocol stuff.

Part 5: SSH and PROXY protocol

Alright, all of this setup was surprising and not really that great, so let’s do some cleanup. There’s two improvements to be made here. First, the ingress server should be used for all traffic, so it should also forward SSH traffic. Second, fail2ban should be configured properly. And that’s it. But there’s an issue – OpenSSH doesn’t handle PROXY protocol. And even if it did, traefik host cannot do much with fail2ban because all traffic comes from the ingress. Let’s first focus on proxying SSH traffic. To solve this, I tried couple of things.

For one, using traefik itself to unwrap TCP stream destined for the SSH service works – but as the traefik host initiates a new connection by doing so, the SSH service sees the traefik host IP as the client. Hence, this solution will not work with fail2ban.

Then, gitea seems to support PROXY protocol as well. However, there seems to be some usability issues with the official gitea/gitea docker container – by default, it seems to use “normal” OpenSSH, but gitea itself can also act as an SSH server. So, START_SSH_SERVER=true breaks the container, as there’s now two SSH servers fighting for one port. There doesn’t seem to be an option to disable the OpenSSH server, to replace it with the one from gitea. And the OpenSSH server doesn’t know how to handle PROXY protocol:

1
kex_exchange_identification: client sent invalid protocol identifier "PROXY TCP4 84.0.0.0 178.0.0.0 55808 221"

Next, I found of mmproxy and its alternative go-mmproxy. There’s a blog post that goes into more detail, but the main point is that this proxy unwraps the PROXY protocol and spoofs the source IP, transparently. That is, any TCP application will “just work” with such a proxy between the ingress and the application. After setting go-mmproxy up, I saw following logs:

1
fail2ban.filter: INFO [gitea-sshd] Found 172.17.0.1 - 2023-09-05 14:22:52

The 172.17.0.1 is a docker internal IP. That is, go-mmproxy unwraps the PROXY protocol and sends it to docker which now acts as if it was representing the end user. To fix this, the gitea container had to be moved to the host network instead. After all of the changes, the logs show that gitea’s SSH service can correctly see the actual client IP. (d8744aef4ef8 is the gitea container.)

1
2
d8744aef4ef8: Connection closed by invalid user root 84.0.0.0 port 47868 [preauth]
fail2ban.filter: INFO [gitea-sshd] Ignore 84.0.0.0 by ip

Network overview, with go-mmproxy. The PROXY protocol traffic is highlighted by violet arrows.

Network overview, with go-mmproxy. The PROXY protocol traffic is highlighted by violet arrows.

File: /etc/nginx/nginx.conf
1
2
3
4
5
6
7
stream {
    server {
        listen 221;
        proxy_protocol on;
        proxy_pass {{ home_ip }}:221;
    }
}
File: /etc/systemd/system/go-mmproxy.service

Based on the official go-mmproxy.service.example, with only ExecStart changed to:

1
ExecStart=/usr/bin/go-mmproxy -l 0.0.0.0:221 -4 127.0.0.1:222

Part 6: Wiring up fail2ban with ingress server

The final piece of the puzzle is to have ingress server somehow made aware of the improper use of resources happening on the application servers. Otherwise, ingress server doesn’t have context to block anything. And application servers cannot block anything, as all traffic comes from one host. The solution is to transfer the context data somehow and there are similar solutions done for Cloudflare: ref1, ref2.

A new application is deployed on the ingress server, with its only purpose to provide access to local fail2ban-client to external users. This doesn’t really require much, so below is its full source. It’s then wrapped into a systemd service and managed automatically henceforth.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from flask import Flask, request
import json
import subprocess

app = Flask(__name__)

@app.route('/', methods = ['POST'])
def root():
    data = json.loads(request.data.decode("utf-8"))
    if data['mode'] == 'ban':
        subprocess.run(["/usr/bin/fail2ban-client", "set", "sshd", "banip", data['target']]) 
    elif data['mode'] == 'unban':
        subprocess.run(["/usr/bin/fail2ban-client", "set", "sshd", "unbanip", data['target']]) 
    return "ok"

if __name__ == '__main__':
    app.run(host="{{ ingress_ip }}", port=8000)

And fail2ban on the application server’s side is extended with a new action, which connects to the server above to ban/unban specific IPs.

1
2
3
4
5
6
7
8
9
[Definition]
actioncheck = 
actionban = curl -s -X POST "{{ ingress_ip }}:8000" \
                -H "Content-Type: application/json" \
                --data '{"mode":"ban", "target":"<ip>"}'

actionunban = curl -s -X POST "{{ ingress_ip }}:8000" \
                -H "Content-Type: application/json" \
                --data '{"mode":"unban", "target":"<ip>"}'

The connection between these two is naturally kept private with a VPN tunnel. But overall, that’s all there is to make application server’s fail2ban and ingress server’s fail2ban work together.

In the end, I also moved the SSH services away from the default 22, which probably was more effective at reducing log spam than anything else outlined in this post. But overall, there’s definitely a lot I’ve learned working with different tools and PROXY protocol in practice.