Migrating a SBT project to Bazel.

I’ve been working today on migrating a SBT project to Bazel. I’ve taken a few wrong turns, and I’ll document them later, but this will be my working doc and I’ll add some failures to the end.

Two major components – Bazel’s generate_workspace tool, and SBT’s make-pom command. You’ll create a POM file with the dependencies and repos.

ted:growth$ sbt make-pom
[warn] Executing in batch mode.
[warn] For better performance, hit [ENTER] to switch to interactive mode, or
[warn] consider launching sbt without any commands, or explicitly passing 'shell'
[info] Loading project definition from /Users/ted/dev/growth/project
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[info] Set current project to growth (in build file:/Users/ted/dev/growth/)
[warn] Multiple resolvers having different access mechanism configured with same name 'Artifactory-lib'. To avoid conflict, Remove duplicate project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[info] Wrote /Users/ted/dev/growth/target/scala-2.11/growth_2.11-resurrector-9449dfb1de3b816c5fd74c4948f16496b38952ab.pom
[success] Total time: 5 s, completed Jun 14, 2017 4:00:17 PM

This generates a pom file, but not exactly as generate_workspace wants it. It requires a directory with a pom.xml, so go ahead and turn that into one by making a tempdir and copying the file to it TMPDIR="$(mktemp -d)"; cp /Users/ted/dev/growth/target/scala-2.11/growth_2.11-resurrector-9449dfb1de3b816c5fd74c4948f16496b38952ab.pom "${TMPDIR}/pom.xml"

Next, build

So, on to the failures:
I initially tried to do my own workspace code generation. I took the output of sbt libraryDependencies and turned it into mvn_jar stanzas via script. This didn’t work, for the simple reason that I wasn’t doing it transitively, they mention that in the generate_workspace docs. I also tried specifying that list of deps as a big list of –archive stanzas; That turned out to be a mistake, mostly because of alternate repos. I also had to clean out a broken SBT set of repos; bazel does not play well with repeated repo definitions, while SBT is happy to ignore them.

Security

The big companies I’ve worked at have all had been using security policies. The small companies haven’t. Frequently, all access to production machines have been controlled by a single shared ssh key. This sucks, but is inevitable, given the lack of time to spend on tooling. However, there are some low-cost toolings to make this better.

The basic developer workflow has been – Type in a command, which will generate a SSH certificate, then ask you for your password and u2f auth, and it’ll talk to the central signing server and get that cert signed. This is surprisingly doable for a small org – BLESS and CURSE are two alternatives.

For myself, though, the right thing to do is run ssh-agent. ssh-agent allows you to keep your keys in memory, and can support several keys. It also allows for forwarding the auth socket to a remote host – So if you need to ssh through a bastion host, you don’t have to copy your SSH key to the bastion machine, it can live on your local drive and all authentication requests can go through it. ssh -A enables this forwarding.

The other problem I’ve encountered a few times is that I want to share my ssh-agent across several terminals. This can be a blessing or a curse, but on most of my machines I only have one or two keys, and while I want them encrypted at-rest I don’t care if they’re loaded in memory a bunch. I’ve written the shell script that does this a bunch, and I today asked myself why it’s not in the default ssh toolkit (like ssh-copy-id). Well, it’s not, but there is a tool that does what I’m looking for: Keychain, not to be confused with the OSX tool of the same name. Though, to my surprise, OSX *already has this functionality*; My default terminal opens up with an SSH_AUTH_SOCK already populated, and it’s managed by the system. That’s pretty cool.

Annotated git config.

[push]
# Much saner than the old behavior, and new default.
default = simple
[user]
# Duh.
email = thahn@tcbtech.com
name = Ted Hahn
# Corresponsed to my signing key.
signingkey = 1CA0948A
[pull]
# When pulling, rebase my feature branches on top of what they’ve just pulled.
rebase = true
[commit]
# Sign all commits
gpgsign = true

Bash tips.

Here’s some things you should start most bash scripts with:

#!/bin/bash

set -e
set -x
set -o pipefail
set -u


TMPDIR=$(mktemp -d)
trap 'rm -rf $TMPDIR' EXIT

Explanations of the lines:

#!/bin/bash

The shebang line is a unix convention that allows scripts to specify their interpreter. Since this is a bash script, we tell it to run this file with bash.

set -e

Exit immediately if any command fails. Makes it easy to spot when a script did not complete, and prevents things further down the line from doing the wrong thing because they were only partially setup.

set -x

Print each command as it’s run. It’s fantastically useful debug output, though some production scripts should have this disabled.

set -o pipefail

Exit with failure if any substage of a pipeline fails. This is about commands chained together with a pipe; e.g. If your grep command fails, the execution will fail, rather than simply outputting nothing to the next stage of the pipeline.

set -u

Makes referencing unset variables an error.

Further explaination of the above three can be found in the Bash Reference Manual entry on Set.

TMPDIR=$(mktemp -d)
trap 'rm -rf $TMPDIR' EXIT

Create a scratch dir, automatically delete it when you’re done. It’s often useful to comment out the trap line during debugging.

See also Pixelbeat’s blog on Common shell script mistakes

Symlinks are (not) hard.

I’ve got two amusing anecdotes related to symlinks. By amusing anecdotes, I of course mean incredibly frustrating weird behaviors that took hours to debug. One java, one chef.

Chef

Chef handles environments very well… except when it comes to databags. From my perspective, this is a critical flaw, since the things I want to keep out of the main chef repo (API keys and passwords) are also the things most likely to be affected by the environment. So,  when building, we specify the path to the chef databags, separating out the prod, canary, and dev environments.

For the parts that are common between the databags, I figured I’d use symlinks. Our databags are stored in a git repo, and git interprets symlinks correctly. The full set of databags were copied everywhere, so I could simply include a relative symlink to ../../prod/foo/bar.json for each databag I wanted consistent.  I got the following error:

syntax error, unexpected end-of-input

pointing to a character in the middle of the first line in the file. This made no sense.

It took me several tries with different files to figure out what was going on. The character that was being pointed out, x, was the same as the number of characters in the symlink path. A symlink is sorta just a text file with a pathname and a special flag on it. If you stat the symlink file, you’ll get the length of that pathname, not the size of the file it points to. What Chef seems to be doing is stat-ing that file, then taking that length as gospel – It doesn’t process it as a stream, but as a block of the stat’d size.

I should probably get around to testing that with the latest version and writing a bug.

Java

Java has a really simple package deployment mechanism: JARs. You can put a bunch of classes into a jar, and deploy them as one. If you have a project with a bunch of dependencies, you can ‘shade’ your jar and wrap all your classes into a single mono-jar.

However, for some use cases it’s not that simple. Java up to 1.7 simply won’t accept more than INT_16_MAX class files in a jar (and remember that anonymous classes are a separate file). Further, signatures can’t be retained; A jar has a signing key attached, and all files must be signed using that same signing key, so a ‘shaded’ jar can’t include the original signatures of dependencies.

So, since monolithic jars don’t work in some cases, what do you do instead? You ship several jars. It’s well documented but not well understood that when you specify a jar with java -jar that your classpath is ignored. How do you load multiple jars, then?

Inside the jar is a META-INF folder containing a MANIFEST.MF file. This manifest file contains a bunch of key-value pairs, and one of those keys can be Class-Path. This class-path key can specify additional jars or directories, and it usually will. However, because of deployment concerns, it will generally list them as relative paths or just as filenames. How does java find those files?

In about the worst way possible. Java will dereference any symlinks in the jar it is loading, then search the base directory of the final file it reads for the class-path includes. So, if you have a bunch of projects with common includes, you cannot simply symlink in all your dependency jars; You need hard copies of every jar you include. This also means you can’t simply update a dependency jar in one place, you have to hard-link it in to the working directory of every app you want to deploy.

I guess an option is to simply have a big folder full of all the jars for all the apps you want to run, but that folder can get very cluttered, and it becomes unclear what’s there why – is one of your dependencies shared? Do you have a garbage-collection mechanism for older jars in that folder?

On Monorepos vs Project repos

I’ve seen some talk about whether to keep everything in one codebase, vs having per-project repositories. The answer is very clear to me: Monolithic repos are a must, but Git submodules are functionally equivalent (as I’ll describe later); You should start with one repo, and then subdivide when you have clear submodules.

The importance of monolithic repos vs per project is not about performance, or even directly about organization of your code. Both are fairly clear. It’s about organization of your build system. Good builds are fully deterministic and idempotent, and that is very hard to achieve with a set of per-project repositories.

The unix standard has a very good layout for where code goes. Shared libraries go in /lib, /usr/lib, /usr/local/lib; Headers go in /usr/include or /usr/local/include. But this isn’t a structure for how to organize your code when writing it; It’s a build environment structure. It makes more sense when your entire organization is sharing just one unix machine and environment, but we’re well beyond that.

Because you’re aiming for determinism, you need to be sure that your build environment is the same each time. The traditional unix file structure is not good for this purpose, at least, not directly. That structure is not rebuilt every time; You copy files on top of it, you add and remove, but you don’t reset. You can accomplish a reset – Reimaging your build machine, or using a docker container. And, in fact, the latter is what many people have switched to. But that docker container is another layer of abstraction, another piece you need to manage for full determinism. You need to kill and rebuild it each build, and most solutions don’t. The run new builds in the same container repeatedly, creating uncertainty.

This is why I like Bazel so much. It removes much of the uncertainty, by rebuilding from scratch each time. It has a separate, well defined environment, that it manages and assures is in the correct state. It’s not magic; You can change things, break things, and fool it. But if you don’t touch it, it does *the right thing* and keeps your structure clean, without taking any risks of breaking your whole machine.

Bazel operates on a concept called a ‘workspace’. There’s not a whole lot to it; You pick an arbitrary root directory, and a flag file defines the workspace root. Everything underneath is considered one logical unit. If you’ve got a monolithic codebase, this is a no-brainer.

Git submodules complicate builds a little, but not much. Instead of saying “this build was built at commit A”, you need to know that it was built at commits (A, B, C). But you probably don’t really care to always know the full set of A, B, C; They may move somewhat independently, but your build infrastructure can and should simply serialize them; Keeping a mapping of simple, linear commit numbers to a tuple of commit hashes for each submodule. There’s one disadvantage – It makes ‘who broke the build’ a race condition, if two modules change at the same time. That is solved by a simple answer: Suck it up, both commiters should debug 🙂

When is a cmake not a cmake?

I was trying to install some software from source today. After dicking around for a full day last week trying to port it’s build chain to Bazel, I decided to just install the build toolchain it expects.

I install Cmake, and try to build my program. No dice! This source file requires cmake version 2.8.12, and CentOS 7 comes with 2.8.11. Ergh. So, I go back to source, and grab and make the latest cmake. A quick make, make install, and I go back to the original directory and… get the same error message?

/cmake .
CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
 CMake 2.8.12 or higher is required. You are running version 2.8.11

What can this be? Did it not install? I make clean, make install again, and this time I search for where it’s installing. /usr/local/bin/cmake ; It seems to work. It’s in path. I run ‘which cmake’ and it tells me ” /usr/local/bin/cmake”, as I expect. I run directly from that path, and check the version; 3.2.2, as I expected. What gives? Why does simply running it not give me the expected results? I throw strace at it; The top line of strace shows the correct path:

execve("/usr/local/bin/cmake", ["cmake"], [/* 26 vars */]) = 0

At this point, I’m fairly stumped. Clearly, the old executable still exists, so I search my path for it; The old version of cmake is in /usr/bin. I then wonder at what could be causing the redirection, so I launch a new terminal, type cmake –version… and I get the correct result! Launching bash as a subshell also gives me what I want. So, what’s wrong with my initial shell?

At this point, I was fairly stumped, so I asked my dad, an old unix hand, and he pointed me to the ‘hash’ builtin. Bash keeps a dictionary of commands to their full pathnames; Presumably for speed, as doing several directory listings for each command would be wasteful, even with all of the PATH directories in ram.

The solution is simply to call ‘hash cname’ or ‘hash -r’ to cause bash to redo the dictionary entry for cname, or for everything, respectively.

Make tmux sessions behave like screen

I’ve been using tmux for a while, because it’s got a couple cool advantages over screen, mostly in the areas of “readable code” and “not locking up”. What it fails at, however, is having an interface like screen; What I really liked about screen is that I could have a dozen terminal windows open and use screen as a tiling window manager; I could swap between windows at will with keypresses, having some windows in multiple locations, others nowhere to be seen, and re-arrange how I had with button-presses.

 

The functionality can be replicated with a little known feature called “grouping”. When you create a new session, you can specify the -t command to “group” it with an existing one. This session-group will share all windows, but unlike having just one attached session, switching which window is active does not switch it in all.

 

I also wrote a quick script to automatically look for an existing session, and, if none are detached, create a new one.


tmux ls |grep -v attached |grep -v daemon
if [ $? -ne 0 ]; then
tmux new-session -t master ;
else
tmux attach;
fi

The Monster at the End of this Book

What makes this a children’s classic? The presence of the cute, loveable Grover doesn’t hurt, nor does the fact that the concept is so simply and cleanly executed. But what makes it subversive?

It’s could be considered a toddler’s introduction to the fourth wall. You don’t need to understand what the fourth wall is to understand that, despite it being a static book, Grover is addressing you, the reader, in a way that’s fairly uncommonly used. It’s the encouragment, nay, the imperative to break the rules – on one hand, Grover tells you not to turn the page. On the other, the adult is turning the page – And what happens at the end is rewarding. The contradiction between authorities, and the boundary shattering nature of the journey you go with Grover on, make this a vital book in the library of a subversive child.

Subversion

While many attempts have been made to define it; nobody’s ever been able to quantify intelligence; Reduce it down to core ideals. I can’t do that either; but I can speak to a related ideal – Subversiveness. Here’s the reading list: More to come.

Reading list –

The Monster at the End of this book

Dr. Seuss

Maurice Sendak

Ender’s Game

The Diamond Age

Children’s shows that are inexplicable liked by adult men.

Ted's Excellent Adventure.