2020-04-23 HTTPS caching proxy with Maven 2/2

Configuring a Maven container to use Squid

Goal

Make sure the SECOND and subsequent maven container builds consume cached library files from the local proxy

Background

When I first embarked on configuring Maven with Docker, I was failing to create a cached layer in the container and thought that it was to do with my docker configuration: it was not. Pulling from the MVN repositories is considered non-deterministic, hence the docker build step to interrogate repositories every time this is called. The exception is the “offline” directive which will force mvn to work with the files it has. But if the Dockerfile is changed in any way before the mvn statement it will pull down all the *.jar files again, and it’s takes about 15 minutes. Normally Maven does a good job of caching JAR files locally, and the caching problem could be similated of (i.e. the process of re-interrogating external Maven repositories) by removing the local user’s Maven repo cache deleting file from ~/.m2/repository/* .

Detailed Learning

Note: I’m going to refer to a helper function called log_format. It’s simply an alias for perl -p -e ’s/^([0–9]*)/"[".localtime($1)."]"/e’. I recommend you put the following in your ~/.bashrc and re source the file, this provides a human readable date format rather than seconds past epoch.

$ tail ~/.bashrc
#...
alias log_format="perl -p -e 's/^([0-9]*)/\"[\".localtime(\$1).\"]\"/e'"
#...

Get the container using the Proxy

Lets start from scratch with a Dockerfile using an image curated by the Apache Maven team:

$ docker run --interactive --tty --rm --name=mvn \
   maven:3.6.3-jdk-11 bash
root@7a7807de0fa2:/# exit
$

Lets now wrap this up as a Dockerfile of our own that we will add to:

$ cat ./Dockerfile
FROM maven:3.6.3-jdk-11
CMD ["bash"]
$ docker build --no-cache --tag mvn-proxy:0.1 .
$ docker run --interactive --tty --rm --name=mvn mvn-proxy:0.1 bash
root@1af26683abcb:/project# exit
$

Lets start hooking it up to our freshly minted Squid 4.11 service on the host. First on the host identify the host’s IP that provides the gateway to docker instances:

$ ip addr show dev docker0 | \
   sed -n '/inet/s/\s*inet \([^ ]\+\) .*/\1/p'
172.17.0.1/16

$ sudo netstat -letpn
Proto Local-Address State  PID/Program name    
tcp6  :::3128       LISTEN 8532/(squid-1)$

So given our Squid instance is IP protocols of IPV4 & IPV6 on all devices out of port 3128, and that the docker gateway is 172.17.0.1, the appropriate setting for http_proxy and https_proxy will be http://172.17.0.1:3128 . Note that in docker environment arguments are not available at build time, so need to be parsed on the CLI as --build-env http_proxy=http://172.17.0.1:3128 declared it in the container Docker file as ARG http_proxy.

$ cat ./Dockerfile
FROM maven:3.6.3-jdk-11
ARG http_proxy
RUN apt-get update && \
    apt-get install --yes vim && \
    mkdir -p /project/api
CMD ["bash"]$ docker build --build-arg http_proxy=http://172.17.0.1:3128 \
   --no-cache --tag mvn-proxy:0.1 . # build this with proxy

$ sudo tail -n1 /opt/squid-4.11/var/log/access.log | log_format
[Wed Apr 22 14:50:17 2020].355    225 172.17.0.2 
   TCP_MISS/200 1281190 
   GET http://deb.debian.org/.../vim_8.1.0875-5_amd64.deb 
   - HIER_DIRECT/151.101.106.133 application/x-debian-package

Good: Extracting the last line of the Squid access.log you can see that it’s pulling the vim package install via the proxy, TCP_MISS says it didn’t get it from the squid cache and needed to reach out to the Debian Repos. Running the rebuild again without docker cache shows a repeated TCP_MISS so we need to start configuring squid-4.11’s squid.conf file to start locally caching.

Setting up squid 4.11 for caching

Lets return to us the capability we obtained from the original changes we made to squid in the previous blob post here.

$ sudo vim /opt/squid-4.11/etc/squid.conf
#...
# Uncomment and adjust the following to add a disk cache directory.
cache_dir ufs /opt/squid-4.11/var/swap 100 16 256
maximum_object_size 10 MB
#...
$ sudo chgrp proxy /opt/squid-4.11/var/swap
$ sudo chmod g+w /opt/squid-4.11/var/swap
$ sudo systemctl reload squid-4.11
$ docker build --build-arg http_proxy=http://172.17.0.1:3128 \
   --no-cache --tag mvn-proxy:0.1 .

First we can prove that we’re now caching files:

$ sudo find /opt/squid-4.11/var/swap/ -type f -ls
...
  1844755      4 -rw-r-----   1 proxy    proxy         864 Apr 22 15:29 /opt/squid-4.11/var/swap/swap.state
  1844764   5644 -rw-r-----   1 proxy    proxy     5775428 Apr 22 15:29 /opt/squid-4.11/var/swap/00/00/00000009
  1844765   1252 -rw-r-----   1 proxy    proxy     1281274 Apr 22 15:29 /opt/squid-4.11/var/swap/00/00/0000000A
...

Running the docker build again should now access the files in the cache:

$ sudo tail -n1 /opt/squid-4.11/var/log/access.log | log_format
[Wed Apr 22 15:37:39 2020].582     19 172.17.0.2
   TCP_REFRESH_UNMODIFIED/200 1281228 
   GET http://deb.debian.org/.../vim_8.1.0875-5_amd64.deb 
   - HIER_DIRECT/151.101.106.133 application/x-debian-package

Good, so it’s using the file from cache, but checking the internet for confirmation that it hasn’t changed. We will come back to this REFRESH statement later.

Maven and HTTPS

Now this was not a lesson on APT but on MAVEN. Maven’s Java repositories are held at HTTPS sites. This is good because it means that artifacts are guaranteed by cryptography to have been provided by the source they say it is. IE not intercepted by a man-in-the-middle, and provided alternately. Lets set Maven up. We will copy a project file called a pom.xml file. And pull down some dependencies from external Maven repositories.

<?xml version="1.0" encoding="UTF-8"?>
<project
    xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion><groupId>com.foo</groupId>
  <artifactId>bar</artifactId>
  <version>0.0.1-SNAPSHOT</version><dependencies><dependency>
      <groupId>com.google.collections</groupId>
      <artifactId>google-collections</artifactId>
      <version>1.0</version>
    </dependency></dependencies><build>
    <directory>lib</directory>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-dependency-plugin</artifactId>
        <configuration>
          <outputDirectory>
            ${project.build.directory}
          </outputDirectory>
        </configuration>
      </plugin>
    </plugins>
  </build></project>

In the Dockerfile we shall remove the apt statements, and zero out the access.log for clarity. Note that instead of mvn clean or mvn package I’m using mvn dependency:go-offline which is a convenience function to pull down all external dependencies so that building packages can be done without internet connectivity:

$ echo '' | sudo tee /opt/squid-4.11/var/log/access.log #clear log$ cat ./Dockerfile
FROM maven:3.6.3-jdk-11 as api-mvn-init
ARG http_proxy
#RUN apt-get update && \
#    apt-get install --yes vim
RUN    mkdir -p /project/api
WORKDIR /project
COPY pom.xml /project/api/
RUN mvn --file /project/api dependency:go-offline$ docker build --no-cache --tag mvn-proxy:0.1 .$ sudo cat /opt/squid-4.11/var/log/access.log # produces nada!

Now we will direct the Maven client to use the proxy by amending the user’s maven settings in the settings.xml file. The ~/.m2/settings-docker.xml file has settings only references the /usr/share/maven/ref/repository. This needs to be extended and renamed to ~/.m2/settings.xml.

<settings 
  xmlns="http://maven.apache.org/SETTINGS/1.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0
  https://maven.apache.org/xsd/settings-1.0.0.xsd">  
<localRepository>/usr/share/maven/ref/repository</localRepository>  
  <proxies>
    <proxy>
      <id>example-proxy</id>
      <active>true</active>
      <protocol>https</protocol>
      <host>172.17.0.1</host>
      <port>3128</port>
    </proxy>
  </proxies></settings>

You do need one per protocol, so if you want http as well, another line entry is required. Now lets again run docker build and check squid’s access.log.

docker build --no-cache --tag mvn-proxy:0.1 .

Now after checking back with the host’s logs we are now happily producing squid access.log output indicating that the docker container is happily passing its mvn external requests through the host’s proxy:

sudo cat /opt/squid-4.11/var/log/access.log | log_format
...
[Thu Apr 23 08:52:25 2020].117  42385 172.17.0.2 
   TCP_TUNNEL/200  
   8504288 CONNECT repo.maven.apache.org:443 -   
   HIER_DIRECT/151.101.40.215 -
[Thu Apr 23 08:52:25 2020].117  42387 172.17.0.2 
   TCP_TUNNEL/200 
   738504 CONNECT repo.maven.apache.org:443 - 
   HIER_DIRECT/151.101.40.215 -
[Thu Apr 23 08:52:25 2020].117  42389 172.17.0.2 
   TCP_TUNNEL/200 
   555829 CONNECT repo.maven.apache.org:443 - 
   HIER_DIRECT/151.101.40.215 -
[Thu Apr 23 08:52:25 2020].117  42388 172.17.0.2 
   TCP_TUNNEL/200 
   1232867 CONNECT repo.maven.apache.org:443 - 
   HIER_DIRECT/151.101.40.215 -
[Thu Apr 23 08:52:25 2020].117 133748 172.17.0.2 
   TCP_TUNNEL/200 
   2520133 CONNECT repo.maven.apache.org:443 - 
   HIER_DIRECT/151.101.40.215 -
...

Note: mvn does not require docker build to pass a --build-arg https_proxy= environment variable here, as it’s hard coded into the settings file and will be used regardless of the container environment.

Squid HTTPS Caching, Bumping and Slicing.

Note that although we have now been able to get apt-get and mvn processes running through the host’s squid proxy, the TCP_TUNNEL being logs against mvn requests states that squid is permitting a cryptographically secure tunnel through to the target repo. This tunnel is unable to be inspected or cached — squid simply acts and a traffic director. To permit https request caching the user needs to permit squid to decrypt the request, repackage it and on-send it. Excluding user permission, this is the very definition of a man-in-the-middle attack on yourself, and from this point the squid administrator becomes responsible for content provided to the docker client, because package accessibility, timeliness and freshness guarantees of the HTTPS standard is now voided, and passed to the squid admin — serious business, you need to trust the software and the admin. Compiling your code from source from the squid website is part one of this trust assurance.

Trusting the squid server.

When a squid server becomes trusted clients will permit https packets to be signed by squid. Communication to squid is encrypted with squid’s public key, which squid can decrypt inspect and onsend. We need to create both these keys, the pubic key becomes the squid proxy’s certificate.

#on the host
VERSION='4.11'
#need to use '-E' with sudo to pass VERSION to sudo processes...$ sudo -E mkdir /opt/squid-${VERSION}/certs
$ cd /opt/squid-${VERSION}/certs
$ sudo openssl req -new -newkey rsa:2048 -nodes -x509 -sha256 \
  -extensions v3_ca -days 365 \
   -keyout squid-ca-key.pem \
   -out squid-ca-cert.pem 
   -subj "/C=AU/ST=WA/L=Perth/O=D2I Pty Ltd/OU=Innovation/CN=squid.d2i.net.au/emailAddress=innovation@d2i.net.au"
$ sudo cat squid-ca-cert.pem squid-ca-key.pem | \
   sudo tee squid-ca-cert-key.pem$ sudo -E chown -R proxy:proxy /opt/squid-${VERSION}/certs

As yet we have not used this incarnation of squid beyond the functionality that we originally had from the Ubuntu 19.10 distributed squid version. The two new compilation directives only become useful from this point on. Obtaining --enable-ssl-crtd functionality permits secure storage of SSL keys and certs, the defauld location is ${squid-swap-dir}/ssl_db In our case as we compiled with:

VERSION=‘4.11’ –prefix=/opt/squid-${VERSION} –with-swapdir=${prefix}/var/swap

Then the location is for the ssl database /opt/squid-4.11/var/swap/ssl_db. Once again it’s important that /opt/squid-4.11/var/swap is writable for proxy user. We have already done this above, but we also need to make sure that when we create this database, we create it as the proxy user because if created as root, squid will fail read/write to it, and refuse to start.

sudo -Eu proxy /opt/squid-${VERSION}/lib/security_file_certgen -c \
   -s /opt/squid-${VERSION}/var/swap/ssl_db -M 16MB

Certs are associated to the serviced port within the squid.conf against the allocated http_port or https_port directive, not directly pushed to this ssl_db. We now need to make the appropriate changes to the squid.conf file.

$ sudo cat /opt/squid-4.11/etc/squid.conf
#...
http_port 3128 \
  ssl-bump \
  generate-host-certificates=on \
  dynamic_cert_mem_cache_size=4MB \
  cert=/opt/squid-4.11/certs/squid-ca-cert-key.pem

sslcrtd_program /opt/squid-4.11/lib/security_file_certgen \
   -s /opt/squid-4.11/var/swap/ssl_db -M 16MB
acl step1 at_step SslBump1
ssl_bump peek step1
ssl_bump bump all
ssl_bump splice all
#...
$ sudo -u proxy /opt/squid-4.11/sbin/squid -k parse
$ sudo systemctl reload squid-4.11
$ sudo sustemctl status squid-4.11squid-4.11.service
- Squid Web Proxy Server
   Loaded: loaded 
      (/etc/systemd/system/squid-4.11.service; 
      disabled; vendor preset: enabled)
   Active: active (running) 
       since Thu 2020-04-23 12:15:12 AWST; 6s ago
     Docs: man:squid(8)
  Process: 8194 ExecStartPre=/opt/squid-4.11/sbin/squid \
            --foreground -z (code=exited, status=0/SUCCESS)
  Process: 8210 ExecStart=/opt/squid-4.11/sbin/squid -sYC \
            (code=exited, status=0/SUCCESS)
 Main PID: 8211 (squid)
    Tasks: 9 (limit: 4915)
   Memory: 15.7M
   CGroup: /system.slice/squid-4.11.service
           ├─8211 /opt/squid-4.11/sbin/squid -sYC
           ├─8213 (squid-1) --kid squid-1 -sYC
           ├─8221 (security_file_certgen) -s /.../ssl_db -M 16MB
           ├─8222 (security_file_certgen) -s /.../ssl_db -M 16MB
           ├─8224 (security_file_certgen) -s /.../ssl_db -M 16MB
           ├─8225 (security_file_certgen) -s /.../ssl_db -M 16MB
           ├─8228 (security_file_certgen) -s /.../ssl_db -M 16MB
           ├─8230 (logfile-daemon) /.../access.log
           └─8231 (unlinkd)Apr 23 12:15:13 t450 squid[8213]: 0 Objects expired.
Apr 23 12:15:13 t450 squid[8213]: 0 Objects cancelled.
Apr 23 12:15:13 t450 squid[8213]: 0 Duplicate URLs purged.
Apr 23 12:15:13 t450 squid[8213]: 0 Swapfile clashes avoided.
Apr 23 12:15:13 t450 squid[8213]: Took 0.01 sec (3279 objects/sec).
Apr 23 12:15:13 t450 squid[8213]: Beginning Validation Procedure
Apr 23 12:15:13 t450 squid[8213]:   Completed Validation Procedure
Apr 23 12:15:13 t450 squid[8213]:   Validated 38 Entries
Apr 23 12:15:13 t450 squid[8213]:   store_swap_size = 31964.00 KB
Apr 23 12:15:14 t450 squid[8213]: storeLateRelease: 0 objects

We now require our clients to trust this new “Authority”. Hence we will load the squid certificate into the containers’ Java Keystore, this will inform the maven process to trust squid certificates spliced in it’s responses to maven requests. This certificate must get passed to the container and loaded into the JRE’s cacerts file.

$ cp /opt/squid-4.11/certs/squid-ca-cert.pem .
$ cat ./Dockerfile
FROM maven:3.6.3-jdk-11 as api-mvn-init
RUN    mkdir -p /project/api
WORKDIR /project
COPY pom.xml /project/api/
COPY settings.xml /root/.m2/
COPY squid-ca-cert.pem /tmp/
RUN keytool -v -alias mavensrv -import \
    -file /tmp/squid-ca-cert.pem \
    -storepass changeit \
    -trustcacerts -noprompt -cacerts
RUN mvn --file /project/api dependency:go-offline
$ docker build --no-cache --tag mvn-proxy:0.1 .
sudo tail -n1 /opt/squid-4.11/var/log/access.log | log_format

[Thu Apr 23 12:34:14 2020].579    208 172.17.0.2 
   TCP_MISS/200 645 
   GET https://repo.maven.apache.org/.../{filename}{jar|sha1|pom} 
   - HIER_DIRECT/151.101.40.215 text/plain

We will now see in the access log a different advisory with regard to mavens requests. Rather than a few TCP_TUNNEL events logged, with inspection, Squid can now “bump” and inspect and cache the *.sha1, *.pom and *.jar files. We presently have a lot of TCP_MISS statements, indicating that Squid couldn’t pull the file from the cache, as it didn’t exist, but it could inspect and know the file. We can also prove that we are caching files with an inspection of many new files in squid’s cache directory:

sudo find /opt/squid-4.11/var/swap/ -type f | \
   wc -l # produces a line count of over 700 new files

Running this build again will produce instead produce a build time of

time docker build --no-cache --tag mvn-proxy:0.1 .
...
real 2m21.653s
user 0m0.476s
sys 0m0.454s

Maven is opinionated and refuses intermedate cache

Some improvement but not a great one. The problem stems from some rather firm refresh headers the get spat out of Maven’s default settings, to refuse intermediate caching and to force reloading/refresh refer here. These basically say to any intermediate proxy under no circumstances do I want you cache, or store anything, and if you do I consider it “stale” the second it hits your store:

Cache-control: no-cache
Cache-store: no-store
Pragma: no-cache
Expires: 0
Accept-Encoding: gzip

Making Squid use the “Nuclear” performance settings

However since you have told Java that you absolutely trust everything that comes from the Squid proxy now, Maven can now be forced to eat it. Now Squid can step it up. You officially take responsibility of WWW content, and now apply the nuclear settings:

offline_mode on #never revalidate from the net
ignore-reload #disrespect client request to check the external copies
ignore-no-store #discrespect client request to not store
ignore-private #disrespect client request not to pull from caches
$ cat squid.conf
#place these settings before the first refresh_pattern directive:
#...
offline_mode on
refresh_pattern (\.jar$|\.pom$\|.sha1$) 1440    20%     10080 \
   ignore-reload ignore-no-store ignore-private
#...
$ sudo -Eu proxy /opt/squid-${VERSION}/sbin/squid -k parse
$ sudo systemctl reload squid-4.11.service
$ time docker build --no-cache --tag mvn-proxy:0.1 .
...
real 0m14.105s
user 0m0.391s
sys 0m0.446s

Boom! 14 Seconds for a Full Refresh and Maven Build!

This is where you want to get to, however this is a pretty brutal caching override, and I don’t recommend these unqualified caching settings, but this does achieve an important goal of allowing a full container bebuild but still enabling a bandwidth reduction and speedup from an intermediate private caching proxy, in my case on my development machine. A full review of squid’s refresh_pattern options should be done to ensure that your cache does not become horribly stale, some packages can be cached for eternity as they are versioned. Other files like repo metadata, might need to be refreshed each time because you may miss important security patches or verion upgrades. Consider this wisely Cleaning cache.

If dropping the cache Squid must be shut down, so if you have an HA requirement best do a rolling refresh. We can delete the swap directory contents but we need to rebuild the ssl_db first, before starting.

$ sudo -E rm -rf /opt/squid-${VERSION}/var/swap/*
$ sudo -Eu proxy /opt/squid-${VERSION}/lib/security_file_certgen -c 
    -s /opt/squid-${VERSION}/var/swap/ssl_db -M 16MB
$ sudo -E systemctl start squid-${VERSION}
$ time docker build --no-cache --tag mvn-proxy:0.1 . #initial store
...
real 2m20.783s
...$ time docker build --no-cache --tag mvn-proxy:0.1 . #no check
...
real 0m14.489s
...

Outcome

Hope you enjoy, squid is the most full featured open source implementation of the HTTP caching standard, there are UI’s that sit over the top and there are many other parts to it but what you’ve seen is a subset that can help you as a developer speed up your develoment cycles. Happy coding!