GItLab Outage on 1/31/17: Looking forward

It appears that gitlab made a few mistakes (mostly around backup procedures). They tweeted earlier today that they deleted production data by accident.

Live notes on the recovery can be found here: https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-VCxIABGiryG7_z_6jHdVik/pub

Some noteworthy things(direct quotes)

Problems Encountered

  1. LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
  2. Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
  3. SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
  4. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers. The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
  5. The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
  6. Our backups to S3 apparently don’t work either: the bucket is empty

So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

As of 2017/02/01 04:00 - rsync progress approximately 56.4% – so yay.

Do we want to continue using GitLab? I know @sunbiz hasn’t been happy with their stability… Do we want to host our own?

EDIT: Update from Gitlab:

https://twitter.com/gitlabstatus/status/826662763577618432

I really like their product, but their infrastructure management just sucks… I was hoping with the additional funding that they raised, things would get better, but doesnt seem like it. I feel self-hosting is not an option yet, may be in the future, if we get many infrastructure contributors.

Yeah – take a read through of that document…It’s all basic stuff…GitLab is great – and I actually like it…If we did host our own GitLab, I’d need to set up actual monitoring and start actually being on-call…right now I put out fires if one happens…otherwise I don’t watch things THAT closely…I hated being the sole guy responsible for keeping OpenMRS up and running. I didn’t mind the work – I minded the fact that I didn’t get to detach. I don’t want that happening here.

Well, so long as the evil robots keep at bay, we have Fort GitHUB guarding the crossroads of knowledge.

It seems like it took a major backup failure exposing every single one of their failures for them to finally fix their monitoring…

Well, we can be all high-minded about it, but that seems to be the way stuff works in the real world. I think the good thing was the quality of access to the dirty laundry. They didn’t mince words in the status report.

Yes, I really like that they are open in communicating!!

I did appreciate them being humble and OPENLY admitting they screwed up.

Yes, a nice low bow and some small self-inflicted wounds always make the apology go smoother. The bad part is dealing with all the HUB fanboys that are dancing around screaming “I told you so! I told you so!”, even when I…uh, I mean they didn’t really.

You don’t have to go one or the other…you can use both!

That’s not what she said…

1 Like

:joy: That was awesome.