Thursday, October 02, 2014

So you want to build a Git server?

So I heard you are thinking about creating a Git server.  Try to cash in on the massive rise of Git perhaps?  Maybe it will be online only like Github or perhaps it will be for customers to install like Gitorious or perhaps it will be mobile only!  You of course have a reason why your Git server is going to be better and take the market by storm.  Having been messing around with Git servers for almost almost seven years I have installed many different types many times and even started working on one myself called GitHaven many years ago that I had to abandon.  But my lose is your gain!  Below is the essential wisdom I have gained that hopefully you can use to make your Git server better.

Beyond that one killer feature that started you down the road of making a Git server you will need to implement a fair number of other things which fall mostly under three categories backend, frontend, and bonus.

Backend is the bits that accept connections for the git fetch/push commands and provides an API into the data and system.  It does authentication and authorization.  It is of course completely separate from the frontend for many good architectural reasons.   After basics you can talk about multiple different authentication schemes, hooking into ldap, scaling ssh, and providing many different rich commands that users can use.

The frontend these days includes at least a web frontend.  At the core this is all about either browsing files from a branch/tag/etc or viewing a log of commits.  Beyond basic viewing the sky is the limit with the addition of search, view log by any number of filters such as author author, diff tools, etc.  After that is basic web administration, user account, user settings, ssh key management, gravatar, password reset etc, and the same goes for the various repository settings.  Then you hit upon richer version control features like forking, pull requests, and comments.  Note that even forking and pull requests are imposing a workflow upon your customers and as you create richer tools they will be used by a smaller audience so choose wisely.

Bonus covers everything not related to Git, typically much more project oriented.  This includes stuff like wiki's, issue trackers, release files and my personal favorite the equivalent of GitHub pages.

I have mentioned just a few of the feature that a Git server could have and there are countless more, but what is really truly important for a Git Server to have?  What makes a Git server?  Is it a built in wiki?  Are customers buying your Git server because of the wiki and the fact that they get repository management a bonus?  Okay maybe it isn't a wiki.  What about issue trackers?  It is probably bolted on and no where near as powerful as something like Jira so perhaps not.  Again it might get used, but they wont be buying the product because of it.  So maybe picking on everything in Bonus is easy.

Maybe a better question to ask would be what feature that if it was missing would have the customer seriously consider switching to another Git server because it stopped from doing their job?  What if you provided no web frontend at all and could only interact with the server through api calls or ssh like Gitolite?  Sure for many customers they might be annoyed, but they could get the job done knowing they could clone, work and share changes.  One reason the frontend separated from the backend is that even if the frontend goes down for some reason developers can still "do work".  So what is the backend providing that is so central to a Git server?

  Mentioned above is authentication and authorization.  If the Git server didn't have those built into its core would that matter?  For authentication for example a customer requires you hook into ldap and your server doesn't support it they will look elsewhere, on the web maybe this means authenticating through a 3rd party.  At the bare minimum if anyone could do anything almost no one would want to use it.

  What about authorization?  Authorization is the ability to say yes or no to what a user wants to do.  At the base minimum would be just a check to see if they are authorized and then it returns true they can do whatever they want.  The average depth of permissions most Git servers have is that each repository can have a list of users that can push to the repository.  A pretty basic safe guard is force pushing.  A good example is Atlassian Git server Stash which out of the box doesn't provide you any way to prevent users from force pushing to a branch.  In the age of automated deployments an accidental push could result in downtime, maybe data lose or worse if the wrong branch is force pushed to.  Stash‎ leaves it to the 3rd party market (only a few clicks away) to provide a plugin which prevents force pushing (on every branch, no configuration) so it isn't as bad as it first sounds.  On the other end of the spectrum is Gitolite which lets you create user groups, project permissions, repository permissions, branch permissions, cascading permissions, read, write, force push, create, and more.  They even let you have permissions based upon what file you are modifying, it is very rich and powerful.  What sounds like edge use cases such as a build server should only be allowed to create tags that start with release-* or the encrypted password file should only be modified by one single user are very common.

  Very closely related to authentication and authorization is having an audit trail.  At some point something is going to be configured incorrectly and someone is going to do something bad and the user will demand the logs so they can find out what went wrong (by whom), undo the behavior, and prevent it from happening again (such as someone force pushing to master).  If you don't have logs and can't prevent force pushing to master they might look at you funny and then start looking for a new place to host their code.

  The third thing that the core of any Git server does is provide the ability for users to add new commands.  This is central to Git itself with its hooks system.  The classic server side hook is email where users can configure the hook to send out email to some email list whenever new commits are added to a branch.  Providing a way to add new hooks cleanly into the system is definitely something you want to do, but there is one hook that should be included from day one, the WebHook.  The web hook is a very simple hook that users can configure that says when something changes in a repository send a post to some url.  This url points to something that the customer owns is something they can get working in no time flat with no access to the Git server.  They don't need to learn your api, create a hook in your language or choice or anything that is a hassle (putting in a change request with IT admin to install a plugin).  They pull up their favorite web interface and make their own tool that runs on their box.  The best part is the because it doesn't run on the server security isn't and issue and it can't take down the server no matter how many web hooks are enabled.

So what is at the heart of a Git server?  Authentication, authorization, and extensibility.  Whatever type of product you create you need to absolutely own those because they are central to the success of your product.  You not only need to own the code, not leave it to a 3rd party and they should be the fundamental building blocks of the product.  Maybe the rich extension system you have planned isn't ready yet, but the WebHook should be there.  Maybe the permissions model is in place, but the frontend doesn't expose it yet, having a check box to blocking force push to master should be there on day one proving it works.  Out of the box the lack of features in these three areas are the biggest area where customers will turn away not because they disagree with some design choice, but because they flat out can't use your product and on the flip side if you fully support these three you will find customers migrating to your product.

Extra: Git users love to use hard drive space, especially enterprise customers.  They will migrate an old CVS repository into a single 10GB Git repository without thinking much about it.  The disk usage will only grow and the system must proactively monitor and support this.  Using authorization to limit disk space, notifying users when a push isn't normal (accidentally committed the object files), and letting users see their own disk usage is one method, but it only slows the growth, supporting scaling of the number and total size of repositories is an absolute must.  Even with a small number of users they will find a way to use all of the available space.

Monday, June 30, 2014

Evil Hangman and functional helper functions

Evil hangman is what looks to be an ordinary game of hangman, but even if you cheat by knowing all of the possible words available it can still be a challenge. Try out your skill right now at

The reason the game is evil is because the game never picks a word, but every time the user guesses a letter it finds the largest subset of the current words that satisfies the guesses and makes that the new list of current words. For example if there are only the following six words and the user guesses 'e' the program will divide the words into the following three answer groups and then pick _ee_ because it is the largest group.  This continues until the users is out of guesses or the group size is 1 in which case the user has won.

_ee_ : beer been teen
__e_ : then area
____ : rats

A few months ago I saw a version of this written in C, taking up hundreds of lines of code while it was efficient it was difficult to read and modify.  With only 127,141 words in the entire dictionary file many of the complex optimizations for memory, data structures, and algorithms were silly when running on any modern hardware (including a smartphone).  The code instead should concentrate on correctness, ease of development and maintainability.  Using JavaScript primitives combined with the underscorejs library the main meat of the program fits neatly in just 24 lines including blank lines.  Using map, groupBy, max and other similar functional functions I replaced dozens of lines of code with just a handful of very concise lines of code.

For a long time most of my projects were coded in C++ using STL (or similar libraries) for my collections.  I had a growing sense of unhappiness with how I was writing code.  Between for loops sprinkled everywhere, the way that the STL doesn't include convenience functions such as append() my code might be familiar to another STL developer, but the intent of the code was always harder to determine.  As I played around with Clojure I understood the value of map/filter/reduce, but didn't make the connection to how I could use it in C++.  It wasn't until I started working on a project that was written in C# and learned about LINQ did it all come together.  So many of the for loops I had written in the past were doing a map/filter/reduce operation, but in many lines compared to the one or two lines of C#.

When was launched I tried to solve as many problems I could using JavaScript's built in map, filter, and reduce capabilities.  I discovered that I could solve the problems faster and the resulting code was easier to read.  Even limiting yourself to just map, filter, reduce and ignoring other functions like range, some, last, and pluck it dramatically changes the ease that others can read your code.  The intent of your code is much more visible.  Given the problem of "encrypting" a paragraph of text in pig latin here are two solutions:

Using chaining and map it is clear that the second solution does three things, splinting the paragraph into words, doing something with each word, and combines them back together.  A user doesn't need to understand how each word is being manipulated to understand what the function is doing.  The first solution is more difficult to reason about, leaks variables outside of the for loop scope and much easier to have a bug in.  Even if you only think of map, filter, and reduce as specialized for loops it increases a developers vocabulary and by seeing a filter() you instantly know what the code will be doing where with a for loop you must parse the entire thing to be sure.  Using these functions remove a whole class of issues where the intent is easily hidden with a for loop that goes from 0 - n, 1 - n or n - 0 rather than the common case of 0 - (n-1) not to mention bugs stemming from the same variables used in multiple loops.

Functional style helper functions in non functional languages are not new, but historically hasn't been the easiest to use and most developers were taught procedural style for loops.  It could just be a baader-meinhof-phenomenon, but it does seem to be a pattern that has been growing the last decade.  From new languages supporting anonymous functions out of the box to JavaScript getting built in helper functions and even C++ is getting anonymous functions in C++11.  The rise of projects like underscorejs or the fact that Dollar.swift was created so shortly after Swift was announce I fully expect that code following this style will continue to grow in the future.

Thursday, March 06, 2014

How to stop leaking your internal sites to Gravatar, while still use them.

Gravatar provides the ability for users to link an avatar to one or more email addresses and any website that wants to display user avatars can use Gravatar. This include not just public websites, but internal corporate websites and other private websites. When viewing a private website even when using ssl the browser will send a request to Gravatar that includes a referer headers which can leak information to Gravatar.

When you viewing the commits of a repository on GitHub such as this one you will see a Gravatar image next to each commit.  In Chrome if you open up the inspector and view the Network headers for the image's you will see along with other things that it is sending the header:
  1. Referer:

The past decade urls have for the better gained more meaning, but this can result in insider information leaking through the referer tag to a 3rd party. What if you were working for Apple and running a copy of GitHub internally, it might not be so good to be sending out to Gravatar. Even private repositories on are leaking information. If your repository is private, but you have ever browsed the files in your repository on GitHub you have leaked the directory structure to Gravatar.

While it seems common knowledge that you don't use 3rd party tools like Google's analytics on an internal corporate website, Gravatar images seem to slip by. Besides outright blocking one simple solution (of many no doubt) I have found is to make a simple proxy that strips the referer header and than point Gravatar traffic to this machine. For Apache that would look like the following

<virtualhost *:8080>
RequestHeader unset referer
RequestHeader unset User-Agent
ProxyPass /avatar

Edit: This post has spawned a number of emails as so I want to clarify my setup:

Using Chrome version 33 I browsed to a server running apache setup with ssl  (the url would looks like: https// and on that page it had a single image tag like so:

<img src="">

When fetching the image Chrome will send the referer header of to

While Chrome's inspector says it sends the header just to be sure it wasn't stripped right before the fetch I setup a fake gravatar server with ssl that dumped the headers received and pointed the page to it and found as expected the referer header were indeed being sent.

For all of those that told me to go look at the spec I would recommend that you too read it  rfc2616#section-15.1.3 where it only talks about secure to in-secure connections which is not the case we have here.

Clients SHOULD NOT include a Referer header field in a (non-secure) HTTP request if the referring page was transferred with a secure protocol.

Thursday, January 23, 2014

Ben's Law

When every developer is committing to the same branch the odds that a commit will break the build increases as more developers contribute to the project.

Sunday, December 15, 2013

Tragedy of the commons in software

Unowned resources that are shared in software seem to inevitably end up disorganized.

A few instances of this I have seen include:
  • Shared libraries
  • Shared revision control repositories
  • Shared database
  • Shared folders
  • Shared log files
When different projects are using the same shared resource they often have different needs, goals, rules and cadences.  The resources themselves usually don't provide a way to split up the resource cleanly and one projects ends up spilling over into another one.  Two simple examples would be in a source code repository while one group might name their branches release/minor_release another group might follow build/* and in a shared library one project might declare static objects that eat up ram harming another group that is trying to reduce memory usage.
  The inevitable cleaning up of the shared resource grows to become a monumental and bureaucratic task.  Even what seems like a simple task of who maintains/own something can be a large task and at least once is found to be a guy that is no longer with the project (and by the way it is no longer used).
  Because the resource is shared by many different users it already takes up a decent amount of "stuff" (ram, hd space, bandwidth).  There is an admin team that is in charge of making sure more "stuff" is added when needed and the users inevitably take advantage of this to an extreme example where a user decided to check in a Visual Studio install into the revision control system (deep within the source too).  From their perspective they don't really feel the pain that everyone else might suddenly be burdened with an additional 10GB+.
  Some projects run very lean and clean.  They have very strict rules about how things should work and be stored, but there are many more that don't and as time marches on and new projects are added they end up having cruft all over the place and dependancies across what should have been the project divides.  This abuse of the shared resource ends up hurting all of the projects.  Rules are put in place and even if there is a good reason they are difficult to change.
  What seems inevitable is that slowly some projects start breaking off and using a different resource that is hopefully not shared and in the process the old shared resource gets less attention and is unlikely to ever recover as more and more becomes unmaintained.  Sadly it is often not a big thing, but many small problems that users put up with until one day they realize that abandoning everything will give them a significant boost.
  The only solutions to this problem I am aware of is to first recognize the problem early and second to have a steward, someone whose job it is to rapidly respond to problems and anticipate new ones before users start to leave.  The steward's job include occasionally striking down long held rules about the resource because they are found to be harmful and being the one that causes all the projects pain by forcing a migration.  It is only through these actions that the shared resource can maintain its viability in the long run.

Wednesday, April 03, 2013

Code analysis reporting tools don't work

Code analysis tools are good at highlighting code defects and technical debt, but it is when the issues are presented to the developer that determines how effective the tool will be at making the code better.  Tools that only generate reports nightly will be magnitudes less effective than tools that inform developers of errors before a change is put into the repository.

A few weeks ago I played with a code analysis tool that generates a website showing errors that it found in a codebase.  Like most reporting tools this one was made to run on a nightly cron job to generate its reports.  Upon reflection of my career I have never seen tools of this type produced more than a small improvement in a project.  After introduction there are a few developers that strive to keep the area they maintained clean and an even smaller pockets of developers that utilized the tools to raise the quality of their code to the next level, but they were the exception and not the norm.  A scenario I have seen several times over my career was a project that had tools to automatically run unit tests at night.  With this in place you would expect failures to be fixed the next day, but often I saw the failures continue for weeks or months and were only fixed right before a release.  Once the commit was in the repository the developer moves onto another task and considers it done. You could almost call it a law: Before a developer gets a commit into the repository they are willing to move the moon to make the patch right, but after it is in the repository the patch will have to destroy the moon before they will think about revisiting it and even in that case they will ask if you want to fix it so they don't have to.  This means that code analysis reporting tools are able to make only a small impact but no where near what the desired result is.

After pondering why the reporting tools do so poorly  and how they could be improved to make a bigger impact I finally figured out what was really nagging at me, these tools were created because our existing processes are failing.  If we could catch the issues sooner it would both be cheaper to fix the issue and eliminate a whole class of time wastes. While you could think about new developer training, better code review's, mentoring, etc all of which can be improved, a simpler solution would be to move the tools ability closer to the time when the change is made.

In 2007 I started a project that included local commit hooks with Git.  Anytime I had something that could have been automated it was added as a hook.  When you modify file foo.cpp it would run foo's unit tests, code style checking, project building, xml validation and more.  This idea was wildly successful and there were only a few times (~six?) in the lifetime of the project that the main branch failed to build on one of the three OS's or had failing unit tests.  More importantly the quality of the code was kept extremely high though out the project lifetime.   When working in a the much larger WebKit project when you put up a patch for review on the project's Bugzilla a bot would automatically grab the patch and run a gantlet of tests against it adding a comment to the patch when it was done.  Often it was done before the human reviewer even had a chance to look at the patch.  These bots would catch the same technical debt problems and the report tools, but because it was presented at the time of review it would be cleaned up right then and there when it was cheap and easy to do. Automatically reviewing patches after they are made but before they go into the main repository is a very successful way to prevent problems from ever appearing in the code base.

But why stop at commit time?  Many editors have built in warnings from code style to verification of code parsing.  A lot has been written about LLVM's improved compiler warnings and even John Carmack has written how powerful turning on /analyze is for providing static code analysis at compile time.  Much more could be done in this area to find and present issues to the developer in as soon as they create them or even in real time.

Code analysis reporting tools will always be useful because they can provide a view into legacy code, but for new code project using error reporting before commit time with hooks, bots, and editor integration will be able to actually prevent technical debt and do more for quality than nightly reports ever could.

Wednesday, August 22, 2012

The minimal amount of data needed to lock in users

I recently upgraded to OS X Mountain Lion only to find that RSS support wasn't just moved out of Mail, but out of Safari too.  RSS bookmarks were the only reason I was still using Safari on a daily basis so this removal is forcing me to migrate them somewhere else and in the process stopping my daily usage of Safari.

Stepping back, I realized how crazy it was that I was using Safari to read RSS.  The last five years I have been working on WebKit and browsers, three of those years (until RIM legal killed it) were spent making my very own browser called Arora.  And yet through all those years I still kept using Safari because the switching cost of the RSS feeds were "too high" (I had a mac around with safari so why not just keep using it...).  I even started hacking on a desktop RSS reader at one point.  RSS feed's are not locked into Safari, the Export Bookmarks action is right there in the File menu* and Safari doesn't keep feed data for more then about a month so it wasn't even the rss history I cared about, just the urls.

Here is a case of the bare minimum of data locking and yet it was able to keep a user that writes browsers (including rss feed plugins), uses a different OS as his primary desktop for years.  In the past when I thought about data lock in I thought about databases, custom scripts, iCloud, but with this I realize that the bar is much lower.  It wasn't until they forcefully took away the feature that I sat up in a daze wondering what I was doing and went looking for an alternative and in the process am going to abandon the application entirely.

Now imagine you are a Windows user and suddenly all your apps don't work on the new Metro arm based laptops.  It is probably the needed kick in the pants to sit up and go checkout what those mac's, web applications, and ipads are all about.  Scary stuff for Microsoft.

* You would think with Safari RSS users suddenly not having their RSS feeds apps like NetNewsWire would provide a bookmarks import, but oddly they don't (as of yesterday when I checked with the current version).

Popular Posts