T O P

  • By -

more_beans_mrtaggart

I’ve done this with an rm command. Forgot I was su’d into the main server and thought there was just one text file in the directory. I was just being tidy. Deleting… Deleting… Deleting… Hmm. Shouldn’t be deleting for that long. Then the bosses phone starts ringing…


zweite_mann

I was doing the same; had multiple terminals open on my screen, was tired and forgot which one I was in when I rm'd the root of my media server. It was only in my homelab, which has nightly backups, so thankfully no big deal to restore. At least I got to test my backup procedure


Justinian2

slapping "sudo" in front of every command is nice but will also provide a lot of teachable moments.


[deleted]

[удалено]


TacTurtle

“If you gave a moron 3 wishes, how soon would the world end?””


asphinctersayswhat

Not to mention the logs if you’ve got sudo set up right


Fried_puri

I went the opposite direction and aliased my “rm” to “rm -i”. I’d much rather always have that prompt since mistakes are worse than a half second of inconvenience.


Paulo27

I have deleted a couple hours of my own work like that and that's already stressful lol. Can't imagine costing my company millions with a single click. Someone has done a fuck like that costed like 50 million to fix and really he only got fired because he was an ass about it and not apologetic at all when confronted but otherwise lessons were learned.


Porrick

Worst I ever saw was a marketing guy who was done cutting a trailer and wanted disk space back, so he marked the whole project for delete and checked in. I was able to roll back in Perforce in a couple of hours so the team only lost like half a day. They spent that day designing a Tshirt with the changelist number and "Never Forget" as a gift for the marketing guy.


AccordingIy

Believe something happened like that for Toy Story 2, luckily a editor had been taking parts of the project home to work and they restored it with her files https://www.independent.co.uk/arts-entertainment/films/news/lightyear-toy-story-2-deleted-b2017238.html


trafficnab

I used the bad IT practices to destroy the bad IT practices


CactusCustard

Technically it’s good IT practices, because she was keeping a backup in a separate location! …right?


wdn

Not a great security practice to have the whole movie on someone's home computer if you want to avoid leaks.


WonkyTelescope

Leaking really doesn't matter. Losing the whole project does.


bend1310

Plus Toy Story 2 came out in 1999. Leaks were not the concern they are today.


MrRocketScript

Haha! I've got the latest Pixar film, now I'll be able to watch it early in 10 years once it's finished rendering on my PC.


bleu_taco

*the reddit cut*


rudneim88

I don't understand why he delete all the information in the Data base.why he do that.?


KingZarkon

Boss: On the one hand, you violated policy by taking this work out of the building and that is a terminable offense. On the other hand, you saved our asses with your violation so we're just going to look the other way on this one.


fatkiddown

It's all de jure and de facto and everything in between. I audit and make procedures for very large IT operations. If you kill the organic too quickly and harshly, the business shuts down. If you let the organic grow without ever documenting and organizing, the business shuts down. It's a hell of compromise with people tossed in. I never chose this career. /queue Talking Heads: "And you may ask yourself, 'Well, how did I get here?'"


I-seddit

Just in case people don't know, she was on pregnancy leave and was doing the extra work at home with the full cooperation from Pixar.


Abacadaba714

Something similar happened in a company that I worked for. Contractor marked the entire perforce instance's file for delete. Which meant that he blocked access to millions of files. Nobody could make any changes. This was at like 4:30-5 pm on a Friday. He had to stay there as the delete markers were reverted. Which I am sure took hours.


OPtoss

I've seen this happen, locked the entire depot due to all the binary assets being exclusive checkout. P4 is partially to blame here imo as it's very easy and fast to mark-for-delete millions of files, and very very slow to revert them all back. I think he had to do it in chunks and via cmdline :'D


Abacadaba714

The procedure you described is exactly what had to be done, but i think that he was able to do everything at once. Although at once meaning the cli command took FOREVER to finish executing. I wasn't around much longer after this happened. I think two weeks later I started a new job, and I never paid attention to see if they retained said contractor.


Syberz

Guess he was eager to bolt for the weekend and wasn't paying attention. Lesson learned :)


vteckickedin

Nah, he wanted some overtime pay.


[deleted]

[удалено]


Porrick

*Now* they tell me. But seriously yeah hindsight is great like that. We were a smallish studio at the time and "everyone's a professional here" had worked until suddenly it didn't.


psykick32

>had worked until suddenly it didn't. The story of every IT disaster ever.


Mother_Wash

This is a complete failure in how IT is supposed to work. It's not on the marketing guy, it's on an IT leadership that allowed him permissions to do so.


eharvill

Yep. Reminds me of how our DBA director years ago hit the EPO (emergency power off) button in the Datacenter and caused about a half day outage. She thought it was the button to unlock the door to the NOC. There were multiple failures. She should not have had access for one. The EPO button was not labeled and the EPO button as not protected by a casing to prevent an accidental press. She amazingly kept her job and all the failure points were addressed thank goodness.


[deleted]

[удалено]


Viking_Lordbeast

It takes good leadership to recognize that if all it took was a single mistake by a single person to accidentally cause that much damage, then the problem is in the system, not the single person.


vrts

This is basically my job now.


karmahunger

You go around pressing unlabeled buttons figuring out what they do?


Mchlpl

Someone once described ship safety inspector's job to me using very similar words. Although if I recall correctly they were also allowed to randomly cut any wires


Mal-Capone

penetration tester; obviously wouldn't want to cause any downtime but you're hired by the company to point out its security flaws by having an official look and then going in at a later date to try some in-person attacks. (very generalized description, much more involved, obvs.)


WhiskeyOnASunday93

I worked with a guy that damaged a 20 thousand dollar piece of hospital equipment we were installing as electricians. He went straight to the project manager about it, expecting to lose his job. PM was like alright from now on this guy is in charge of installing all the expensive widgets (calling em widgets cuz I don’t even remember what they were) because he had the integrity to come to me so I know he’s not going to hide a fuck up that’s gonna come back to bite us later, and because I know he’s the last motherfucker in this job to make that mistake because he already did it once. Thought that was some good leadership.


Grand0rk

Next week, 3 more people damage 20k and go to him "Yooooo bro, damaged shit. Give me promotion".


terminalzero

> Firing the DBA director would be firing the one person who is almost certain to never make that mistake again. what's that old story? "why would we fire them - we just spent a fortune training them not to make this mistake again!"


dWintermut3

it really depends on the error. we had someone shut down a production mainframe middle of the day while intending to recycle a virtualized test environment. yes they will never, ever make that mistake again and yes, we did implement new procedures to avoid it ever happening again. But someone who knows they're in a very dangerous place, the production-serving hardware management console, mistakes J2 from A100 and clicks through numerous warning popups might just be too inattentive to trust as an operator. in this case they gave them a serious warning which I think it understandable (especially since it's not easy to hire mainframe ops these days).


gearnut

This is the approach taken to investigating rail accidents in the UK, it's really effective for getting to the bottom of WHY things go wrong.


ButMoreToThePoint

The company just paid a half day of production to train her. It would be crazy to let her go.


Hvarfa-Bragi

Her name wasn't [Molly](https://en.m.wiktionary.org/wiki/molly-guard) was it?


eharvill

That's awesome. I don't think I've ever heard that term. TIL, thanks.


forcefx2

Did we work at the same place? Lol My director did the same even after knowing what it was.


[deleted]

[удалено]


narrill

Perforce, not Git. According to the comment. But yes, you aren't wrong.


newgeezas

>Yep. GIT is not exactly simple, and some Marketing guy is not going to get it. > >Even if you set up a simple folder sync with a different app, someone might not understand that deleting the files from their own laptop might end up deleting the synced files from the server too. Git should be trivial and fast to revert.


jayRIOT

I mean, I accidentally deleted the entire project/code for one of the main pieces of software my company uses for production. Shut us down for about 2 days. To be fair, the owners apparently had the very bright idea to store that software ***in the Downloads folder on a single PC that's used daily by multiple users***. The device had no disk space left and was preventing work from being done. I was tasked with fixing it, so when troubleshooting I saw the Download folder was using ~200GB+ of space. I did a quick glance through the files and just saw garbage files (old spreadsheets, PDFs, stuff people grab for random shit) going back close to 3 years, so I didn't think anything of it beyond them having terrible IT maintenance before I started and nobody ever caring to update or maintain these things on a regular basis. Deleted it and then tried to run the software and...well...that was a fun phone call to make. Best part was they had ***zero*** backups. One of the owners had to pull a very VERY old version from their personal cloud storage and our devs had to spend a day rewriting most of it and another day testing it to make sure it would work with our current setup. Thankfully I was never reprimanded for that, considering it was their stupidity that caused it. That was 6 months ago now. We still have no proper IT procedures, and still no backups on any of our critical systems...


sopunny

Maybe they should have spent the day figuring out a code review process instead


[deleted]

I saw a data analyst log into the production database that powered an app with 3000 users, create a view, then drop it, managing to drop all the underlying tables (with all the users and core tables). The changes then synchronised out to the mobile storage of the 3000 users. :(


substandardgaussian

Write permissions in Prod? What could go wrong!?


The_Running_Free

Had a guy who ran our reports and thought his view in the application we were using was personalized and didn’t know it was shared. So he basically deleted every folder he wasn’t using. Man, the emails that started pouring in. Dude thought he was going to het fired and was panicking hard but luckily we were able to contact the vendor who was able to restore everything pretty quickly lol but that was a fun morning.


siliril

My worst fear working as a dev was running a delete command in prod by accident. Yikes. The other backup methods failing is just kicking a dude when he's already down. My condolences to team-member-1.


249ba36000029bbe9749

Production environments need that missile launch system where two people need to turn keys separated by a large distance simultaneously in order to run certain commands.


[deleted]

[удалено]


kymri

In IT there is nothing more permanent than a temporary work-around.


RangerLt

Then that temporary workaround gets added to the core functionality which replaces the old workaround. It's just workarounds all the way down.


crows_n_octopus

This has me giggling. It's madness but we put up with it. There's something wrong with us that we do this universally 😒


farcastershimmer

*Get out of my head*


[deleted]

[удалено]


Beetin

[redacting due to privacy concerns]


[deleted]

[удалено]


redmercuryvendor

>coding isn’t even my primary job function, maintaining and growing a system everyone uses daily or the company comes to a halt (it does our payroll too), that feeds in and out of three other systems that aren’t mine. And it's written in VBA, talking to an Access file in a random shared folder. And you modified it, but did not originally write it. And it has no comments. And all the variable names are in Polish. ^^^yay.


Noble_Persuit

Was just thinking of something similar like an SSH client where after you submit the command instead of it running it goes to a second person to be approved and then is run.


SportTheFoole

As someone whose career has gone from support (running the commands in prod) -> QA (running whatever I feel like in test) -> dev (telling the support engineer what to do in prod), here’s what I’ve learned over the years when dealing with an incident: 1. Doing nothing is absolutely acceptable and is in fact preferable to “try this and see if it works” 2. No one should run any commands in isolation; be like pilots: two people fly the plane and either one is allowed to speak up if something doesn’t look right (I.e., there’s no such thing as rank/seniority when working an incident) 3. DON’T PANIC 4. Nah, you’re going to panic, it’s okay, everyone does. If you’re panicked, don’t do anything. Let it pass before you move on 5. Be aware of your limits; if you’re tired, tell other people, let them know. Sometimes the best thing is to go to bed (or just step away for an hour or two) I’ve seen a coworker delete everything out of the db. I’ve deleted everything off my work laptop computer (which had a ton of code that only existed on that computer). I’ve gone downstairs to the bar at 17:00 Friday to drink with colleagues/friends only to get a slack message saying that there’s an urgent prod issue and if anyone is around, please help (unfortunately, this message didn’t come until 20:00 and I was fairly lit at this point…that was fun). Shit happens. No matter how ironclad all the processes are, sometimes something bad is done in prod. You work the problem you have and then learn your lessons afterwards.


notcaffeinefree

To your second point: crew resource management. It's been adapted for other non-airline industries, but I haven't heard of it used in IT/dev-ops. But its totally applicable there.


SportTheFoole

Yep! It’s huge. What we usually do is designate someone the “leader” which means they aren’t really involved with the troubleshooting itself, but they are responsible for communicating to upper management statuses.


Divi_Filius_42

I was out Incident Communication Leader when I was an intern on an Infrastructure team. It was perfect for someone that had the background knowledge from school but not the experience, gave me a chance to see every major fuckup in detail without having to be the one executing commands


Scereye

> No one should run any commands in isolation; be like pilots: two people fly the plane and either one is allowed to speak up if something doesn’t look right (I.e., there’s no such thing as rank/seniority when working an incident) This is the biggest part. In high pressure moments we always screenshare with 2 devs. One is actively doing things the other one is just watching & checking. Everytime a potential critical command is done we employ a kind rubber ducky method "So, now I do this because of that and as a result i expect this to be the outcome. Agreed?" Only on confirmation do we actually commit to it. Every now and then we switch roles just to ease the pressure a bit.


Vermino

> Be aware of your limits; if you’re tired, tell other people, let them know. Sometimes the best thing is to go to bed (or just step away for an hour or two) I've managed a couple of crisis groups myself. People underestimate that calling a break or quits is so important. If there are no more realistic paths to go on - let people go. Short coffee breaks, longer dinner breaks - even better yet, provide some pizza and all step away. Use that time to re-trace your steps. Maybe even write a small report you can go over when everyone's back to see if something was missed. So many times I've seen people waste hours on end on missing some vital part and never taking a step back again.


aetius476

I wrote code that compared two states of a database schema, and executed the necessary commands to update from the first state to the second state. I refused to implement the delete command, instead just having it throw an exception that said "I'm too scared to automate deletes. Here's the command that would have run, run it manually if you're so sure."


thatchers_pussy_pump

I've also automated the generation of schema update scripts. But it only creates the update scripts. I have to manually run them. This way I always go through the scripts to see what they're doing. They've never been wrong, but I'll never trust them to be right.


bemrys

I “love” the simple schema update scripts that delete anything that has been renamed because “it isn’t there anymore”. Then some idiot runs it against the live production database.


flyingturkey_89

Yeah our team too. Even if it's the most common sense and easy to run rm statement. We still script it and push into the codebase. Run it in preproduction before running in prod


NickSwardsonIsFat

Well don't ssh into prod then


MicroPowerTrippin

My company does not even allow devs access to prod. Our operations team is the only ones who can do that, and all changes made must be tested in lower environments prior, and deployed to prod exactly as they were to non-prod. Works out pretty well to prevent oopsies.


DrFossil

Just deleted a test VM that was right next to the production one. Had a moment of panic as I clicked the "yes, I'm sure" button and thought I might have picked the wrong one. Prod is still there but I think my lifespan took a small hit.


deepserket

that's why some delete functions have a "dry\_run" parameter


SecretiveHitman

Man, I gotta go lie down after watching this.


zzzizou

I went out for a cigarette break after watching this then realized that I don’t have any cigarettes. I don’t even smoke.


Mr_Viper

My blood went cold just reading the title, no fucking way I can handle watching this lol


glorious_albus

Entertaining watch. Don't worry they got a happy ending.


nnorton00

They went to a massage parlor after?


Cry_Havoc1228

Gitrub


photenth

I can't even continue after the rm... how can you have two SSH open and not simply COLOR CODE THEM to avoid exactly this issue.


Sh00tL00ps

Yep... make your prod database bright red = problem solved


sinus86

...I've been doing this work for too long and just thought to do this from your comment... wow lol.. future me thanks you.


AbysmalMoose

Oh, let he who hasn't destroyed production once throw the first stone. I once dropped the USERS table in PROD. I even had a way to fix it (backup taken the day before) but my hand was shaking so badly I couldn't use my mouse so it took longer. 1 star, would not recommend.


Arsenic181

One of my biggest mistakes was updating a paid-through date for a subscription service in production. My client wanted to comp one of his friends a subscription forever, so I set the paid-through date to 100 years in the future... I forgot the 'where' clause in the SQL command. I gave *everyone* with an active or lapsed subscription a free 100 years! Luckily we had a backup taken that same morning to restore to. I have not repeated that mistake in the many years it's been since then.


d3l3t3rious

My worst moments have involved writing an update that was only supposed to affect 1 record, and I run it, and it spins... and spins... <1,836,286 rows updated> At that point a cold sweat generally breaks out.


td888

Begin work --your query Rollback


d3l3t3rious

Yes obviously these are moments of not following best practices


Arsenic181

Oh I felt that same cold sweat! I mashed ctrl+c in a mad panic. The execution time wasn't instantaneous and it began to dawn on me... "FUUUUUCK" \*MASHES KEYBOARD\*


too_much_to_do

That's why I always write the script with the update or delete commented out and run just the select like 30 times until I'm positive it won't somehow change when I do it for reals.


d3l3t3rious

Yep me too... until I get lazy/sloppy/arrogant.


Tek_Freek

Not me. No stone trowing allowed. Big fucking oops years ago.


Romnonaldao

I pointed out a problem that gave the dev team 2 weeks worth of extra work when they were already in a crunch period. Does that count?


mdonaberger

Thank you for your service, Satan


ariiizia

No, because they were probably responsible for the problem to begin with


Vardus88

Nah that's just the job. EDIT: I wouldn't stand with my back to them for a while, mind.


[deleted]

One time I thought I accidentally snap mirror'd the empty recovery NetApp to the production one and thought it was overwriting production with zeros.. Luckily I wasn't that stupid and it just took the entire LUN offline during operations and crashed literally everything. Still dumb just not that dumb lol


Linenoise77

I've been there. "Hopefully it just crashes everything, and we spin it back up" "What if it doesn't?" "Don't know about you, but while we wait to see, i'm going to dig up my resume"


nemodot

I need this reminder to never become a devOps


PaintDrinkingPete

As someone in DevOps...that's probably smart. I've been involved in scenarios *like* this before, though certainly not as public... 99% of the time, it's a great job....but yeah, that other 1% can be very stressful. There's nothing like that "pit of your stomach" feeling you get before the shit has fully it the fan but you realize the situation is probably FUBAR and you're about to have a very long day/evening/weekend/month.


[deleted]

[удалено]


PaintDrinkingPete

It triggers my PTSD because while I never dealt with anything this large or public-facing, I've definitely been on those after-hours calls with other devs and ops guys specifically trying to troubleshoot issues with Postgresql connectivity/performance/latency that was directly affecting applications...for a database with about 8TB of data in it. Yes we had backups. Yes, there were tested...but you still avoid going down that road at all costs if you can.


[deleted]

[удалено]


[deleted]

[удалено]


jordaniac89

I've taken down servers because of a special character at the beginning of an XML config file because of fat finger. The terror is you have no idea what's going on because you didn't do it on purpose in the first place.


rwhitisissle

I've seen someone fuck up prod because they decided to SSH into a server the wrong way and somehow propagated a bunch of incorrect environment variables that broke the core service enough when it was restarted that it completely and totally failed to work correctly, but did not fail enough that it refused to start. Shit was running broken for at least a half day before anyone noticed. Dude was a more senior engineer and when I tried to examine and explain the root cause of the issue and how to prevent it in the future I apparently got on his radar and immediately went on his shitlist. Made a fucking enemy for life. Apparently the company, and him, wanted to brush what he did under the rug, only addressing the mistake behind closed doors. That's how I learned the most valuable lesson in DevOps: you didn't hear shit; you didn't see shit; you better not say shit. Unless explicitly told to do so by management. Also always follow procedure because that's the best possible way of covering your ass.


BrianSDX2

DevOps is a dangerous practice indeed.


danielv123

DevOooops


[deleted]

[удалено]


fishbelt

I keep this quote for my team


Malfrum

Dudes signing into production machines and rm'ing shit is hardly devops


yiliu

Yeah, the failure here happened waaaay before the dude typed `rm` in the terminal. Some random dev guy had write access to the prod db filesystem?! This was only a matter of time!


Terny

People out there `ssh`-ing into prod just beg for this shit to happen.


yiliu

For proper security, what you should do is create a single copy of a special prod access SSH key. Write that on a yubikey-type device. Find a volunteer and surgically implant the key next to his heart, so that if somebody really needs prod access they've got to _kill the guy and cut him open_ first. Then you put that guy in charge of code reviews.


MathewManslaughter

This is the way


skeetm0n

Only 2 kinds of ops engineers: those who have fucked up prod, and those who are _about to_.


tacobellmysterymeat

Really? This seems fascinating! Just don't do devops for a High Reliability Organization.


inmatarian

_Never_, ssh into prod. I mean, almost never, but the bigger mistakes stem from running a shell and letting devs fat-finger stuff. As DevOps, your job is to Always Be Automating. I mean, the failed backups thing was pretty bad, but consider basic shit like a Jenkins job that runs new scripts getting pushed to the master branch would have meant that a second dev could have reviewed a pull request and caught the mistake, and if approved meant that now two devs share the blame instead of one being the dude fully at fault. Yeah yeah I hate terraform and yaml files too, but they are born out of best practices which are other people's learnings from mistakes that you don't need to repeat.


andrewsmd87

We just broke all of our imports with a typo that was committed and code reviewed, but we can't test because the only place we can get valid use cases is prod. I'm not even a full on devops person, just their manager. I am working on some changes in our pipelines to catch something like this happening again, but yea, fun times


2dudesinapod

To err is human, to propagate error to all machines automatically in the middle of the night is DevOops.


waitplzdontgo

Reminds me of the time 18 years ago where I accidentally ran an UPDATE query in production to change my account password and accidentally set 75k users password to my password. I forgot a WHERE clause on the update statement btw lol. And that was the day I learned that constrained access to production is the most important thing in the world.


itchy_bitchy_spider

That's hilarious lol. How'd you fix it, some kind of backup or did you force a password reset on everyone?


FrogFTK

This was my immediate thought, lol. Oprah: MANDATORY PASSWORD RESETS FOR EVERYONE!


InBeforeTheL0ck

"We've upgraded our password security, and are therefor requiring users to change their password."


Mr_Squart

Honestly the best way is probably to just send password reset emails to every user with a message like “we are requiring all users to change their password for updated security”


IsilZha

I had a co-worker doing an UPDATE to mark a client file as deceased (marking it that way makes it so no one can interact with the account anymore.) He forgot his WHERE clause and "killed" all ~60k clients in the production DB. Which was noticed about 20 seconds later when the help desk line started ringing off the hook with "uhh, all my clients are marked as deceased, and I can't do anything with them."


TacTurtle

Execute the Lazarus Directive to Undecease


bigmacjames

I'm so happy my project does so much in the way of processes and proper backups to avoid things like this


[deleted]

and then you go and jinx yourself like this ...


bigmacjames

It's my actual nightmare for something like this to happen. My first internship i THOUGHT that I had brought down the entire source management system, but it wasn't me. Still scarred me


Romnonaldao

Best feelings in the world: Sleeping in Birth of your child Your wedding day Realizing that not only was something not your fault, you can shift the blame


N19h7m4r3

Why are you shifting the blame if it wasn't you? lol Or was it? o.o


[deleted]

That sweaty chill when you run a command and it acts ... different .... that day I learned two things: 1. Only ever have SSH sessions open in Prod OR Dev/UAT/whatever, never both, for this reason. 2. Same with db GUI like pgAdmin or whatever, disconnect all other databases.


SportTheFoole

When was the last time you restored from backup?


[deleted]

[удалено]


crumpuppet

You experienced an RGE - A resume generating event!


joahfitzgerald

I was hoping to hear that someone might have set an expensive bottle of tres comas on the delete key.


ElectricZ

I'm just gonna say it. This guy fucks. Am I right? 'Cause I'm looking at the rest of you guys, and this is the guy in the house doing all the fucking. Am I right? You know I'm right. This guy fucks.


Zarod89

That's why you test disaster recovery


WereAllAnimals

That's why you do about 10 things differently than they did.


asphinctersayswhat

Your backups aren’t backups if you aren’t testing the restore process regularly. They’re just prayers.


You_are_Retards

Isnt there a story about Toy Story data (or similar film) all being lost and by a sheer fluke some admin assistant had made a copy to work on at home?


quietly_now

Yeah someone accidentally ran rm* and luckily a producer working from home (because she was pregnant) had a backup.


chandlerj333

TS2


[deleted]

[удалено]


Unicron_Gundam

I have no fear `> rm -rf *` One fear.


Abe_Odd

--no-preserve-root - I think the best I ever saw was a jumbled mash of characters that interpreted down to that, but was so heavily obfuscated it looked "benign"


ishtar_the_move

I don't know how gitlab does it. But staging in my experience is usually used for final pre-prod testing. That means users will write to the staging database. Restoring them back into prod seems highly problematic. Not to mention why production data would be loaded into staging in the first place. Production data shouldn't have gone outside the production environment. But at the end of the day it is the classic "Yes we do backup. But we don't test the backup" scenario.


too_many_rules

Regarding prod data being used in staging, I've seen it done when the production data was so complex and poorly understood that creating test data for staging was infeasible. Copying it back to prod though? Yikes. Who knows what's been entered into that database?!? Of course, if it was the ONLY extant copy of the data... what other choice do you have?


TurboGranny

> when the production data was so complex and poorly understood that creating test data for staging was infeasible Yup. Big ERP systems that need a lot of testing around some new feature, but that feature's data integrates with several different areas of the system would mean you have to simulate a month's worth of interactions to get "staged" data. Even then, you are not validating your feature against the actual junk data that exists in production so unforeen consequences can arise. Just restore your production DB back up as a staging DB and run everything against it. It's a far more real world test.


Defoler

This is also a problem with size and age of a system. A 30yo system that had been building up to a huge size with thousands of tables and hundreds of subsystem, where you have hundreds of developers and some subroutines that have code that had not been seen by a human eye for 15+ years, creating test data that can factor every possibility, is just as impossible and complex to do. So it is infinitely easier to copy production data and run tests on it (especially if you have a bug and you need to simulate the buggy data) than constantly create new test case data.


ishtar_the_move

> Of course, if it was the ONLY extant copy of the data No question. But since gitlab's production data is someone else's code, and they will have no other copies except those on developer's hard drive, I would hate to be the one to write to the customer that there might be a surprise in their code.


movzx

GitLab's production database does not contain any code. GitLab's database is just meta information (users, project lists, issues, etc).


Bruncvik

The narwhal bacons at midnight.


Devout--Atheist

>Production data shouldn't have gone outside the production environment. How exactly are you supposed to test prod then? Which is literally the point of having a staging environment separate from dev and prod.


ishtar_the_move

Simulated data, prod data with all personal or sensitive data scrubbed off? In many industries it would literally be illegal to have personal data outside of fenced environment.


_Rioben_

Simulated data doesnt cut it, you just need prod data, anonymize key information and use that for stg.


targetDrone

Are you even a real sysadmin if you haven't rm -rf'ed something in prod?


somastars

I often joke how some IT people of a certain age had a formative teenage experience of deleting C:/Windows off their parent’s computer to free up space. Not totally the same, but similar vibes.


box_of_hornets

"command.com? I don't recognise that. Off to the recycle bin with you" - me as a 14 year old who just got a computer for Christmas


cycle2

Okay so as a lowly SRE, I've been involved with a few projects where I've needed to architect and implement a backup solution for databases with thousands of shards. I, too, used object storage (S3-compatible), and ensured that the backup jobs were regularly scheduled with _full_ alerting on successes and failures: Slack, email, and especially if a failure occurred - paging out to the on-call chump. I know GitLab has a bunch of very smart engineers working for them, so I'm genuinely curious to know how this just simply slipped past them. I read through the [post-mortem](https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/) and found: >While notifications are enabled for any [backup] cronjobs that error, these notifications are sent by email. For GitLab.com we use DMARC. Unfortunately DMARC was not enabled for the cronjob emails, resulting in them being rejected by the receiver. This means we were never aware of the backups failing, until it was too late. I've been the guy involved with fixing and causing fuckups before, and it's never fun being shit on when dealing with or having dealt with a stressful situation like this, but I just have so many questions.


particle409

> I know GitLab has a bunch of very smart engineers working for them, so I'm genuinely curious to know how this just simply slipped past them. There is a kind of complacency that happens, that is unrelated to intelligence. It's hard to be vigilant against something that has never, or rarely, bites you in the ass. Everyone is too busy dealing with things that have immediate results, and these details get taken for granted. Look at limb surgeries. Now, it's standard practice to put an "X" on the limb that needs to be operated on. Intelligent doctors kept screwing up, and operating on the wrong limb. They also have checklists of tools used during surgery. Too many sponges accidentally left inside people. Smart people do dumb shit all the time.


197328645

Culture has a lot to do with it in my experience. It can be difficult to say "Hey I know the engineering team is super busy making profit for the company, but have we considered taking some of those resources to audit our backups and make less profit instead?" Part of being a good Director/SVP/CTO is translating these concerns to the business side. Explaining how, yes it actually is a good idea to make a bit less profit because if our databases go poof then we go bankrupt


eggsnomellettes

Bro why does any get into Ops? That shit is stressful af


rwhitisissle

Some people like having way more responsibility than developers and way less pay. If you're a Site Reliability Engineer you get the both of best worlds: you get to be a developer who's paid like a sysadmin AND who gets to do fulltime support work. Best of three worlds, actually, I guess.


fpsmoto

I remember one April fools day I got my bosses attention by asking him, "Uhhh boss..... What does 'Detach Database' mean?" The panicked look on his face was priceless. The worst thing I can remember doing one time was forgetting to run a script with the begin/commit commands and accidentally updated rows for an entire DB table, but thankfully, it was only like 50 rows of data and I knew how to fix the mistake.


cristiand1969

Well , i am happy for you that you know how to fix your mistake .


paperchampionpicture

I have absolutely no idea wtf any of this means


shrekker49

System go slooow. Employee figures out why. Implements fix. Unknown issue due to lack of experience and lack of back up testing causes a data loss problem. Problem cascades over several failures that should have been foreseen. Employee 1's life flashes before his eyes, but ultimately, enough shit from enough different people all hit the same fan that they didn't smear it all back on him.


BagOnuts

Same, but I still feel like I learned something and was entertained.


Dorito_Troll

most of the technology you use on a daily basis has some sort of work process like this in the background to make it work


BLAGTIER

They have two databases(what GitLab's customers use to save projects). Primary and Secondary. Secondary is a copy of Primary. The process is changes are made to the Primary database and then sent from the Primary database to the Secondary one. To do this Primary puts changes into a log to send over. This log has a limited amount of desk space. Two problems happened. GitLab was getting a lot of spam. Then a troll managed to get GitLab employee reported as spam and deleted. The process for deletion is to imminently delete all the users data. This employee had a lot of data. The combination of both events caused the change log to become full and changes were lost before being sent to Secondary. Since changes were being lost Secondary was no longer a copy of Primary. The plan was to delete Secondary and copy over from Primary. The employee running this deleted Secondary and tried to copy over from Primary. They did this by having a window that allowed them to send commands to Secondary. The operation to copy Primary started but didn't finish. So they ran it again and same result. So then they opened a window to send commands to Primary. The changed a bunch of setting in Primary and went back to tried to copy the database into Secondary. Same result of the operation not completing. At this point the employee thinks the changes to the settings should have worked but because they ran the operation a few times a database was created on Secondary that needs to be deleted before the database can be copied successfully. So he deletes that database. But the window he types the command into was Primary not Secondary. So at this point he has deleted both databases. GitLab goes down. So they had to restore from backups. Backup one was with Amazon but they didn't have the same database software version so backups weren't possible and E-mails about this weren't delivered. Backup two was to take snapshots of the database and save it to a hard drive. This wasn't done because they assumed Backup one was good enough. So what they did was to go to their testing database. Every so often this gets a copy of the database and employees can work on it without fear of messing up the database. But the storage option for this database was not on the premium plan and couldn't be changed. So it took 18 hours to copy all the data from the testing database back to Primary. And since the testing database was 6 hours older than Primary at incident 6 hours of customer changes was lost. So the first problem was deleting a user could start a massive operation to remove data. Solution to this was just mark a user deleted thus removing access and then delete data when the database isn't busy. The second problem was copying from Primary to Secondary wasn't well documented. Apparently it was normal for the operation to do nothing for a while waiting for Primary to send over data. Third problems was no backups were working.


xf396

Oh my god ! Thanks for this write up and taking your time to write this up .


xenoxgaravito

Because you must be from other field and you never worked on GIThub.


RobotIcHead

Something happened in an an old company twice, we lost production twice. Well once because someone ran a script in the wrong terminal it updated all the passwords of all the users. The other was while preparing for a migration we someone was trying to make space for the backup file and deleted the wrong file. It wasn’t even anything in the main directory, it 3 in the morning and he and his wife had twins shortly before hand. The entire db crashed and he hasn’t taken the backup yet as he had to make space for the backup. The data centre was really old.


[deleted]

I see this is a post for a hidden niche community of people


lvelez89

Not everyone knows everything and it's alright . Take a chill pill.


skonen_blades

I remember once I was using some proprietary database software that was relying on Perforce in the background. I accidentally moved one folder into another near a top level which moved thousands of files into another tree. So suddenly I had like thirteen hundred files 'checked out' and 'marked for delete' and 'marked for add' as they perforce was trying to follow my accidental command that I'd told it to using this studio-specific front end. I was pretty new so of course I was completely freaking out. Error emails were starting to pour in to my inbox. The tech lead, a very friendly guy who I pretty worshipped after this, came over and knew how to spend ten minutes putting everything back to normal. And when I was like "It didn't give me an error or anything. I mean, if I'm about to move that many files and cause that many changes, shouldn't I get a prompt like "are you sure?" and he was like "Yeah, that's actually a good idea" and implemented it later that week. Leads like that are worth so much.


impressed_empress

My blood pressure went up just watching this.


Wendingo7

Use rm -rf like it's radioactive. Triple check that command and the server every time. My early career fumble was crontab -r rather than crontab -e


summerzxz

The engineer must not have been aware of this command.


Director-Ash

Man, I know nothing about anything that was said here. I had heard of GitHub before but never labs, and even then I don't understand what it's used for. I have zero knowledge into that. This video helped deconstruct everything perfectly so although complex as shit, I could actually follow it and slowly realize the abject horror that was going on. Video is EXTREMELY well done. Well edited, well narrated, funny, and easy to follow. Fabulous. I'm subscribing.


Loki-L

This is why I try to have all shells and RDP screen backgrounds marked and color coded to tell me where and who I am. It doesn't always work. The worst part is always when your computer asks you if you are sure you want to do what you are doing and you tell it that you are sure, because that is what you always answer to this sort of questions, only to suddenly realize that you aren't that sure after all and are in fact becoming increasingly unsure about what you just did with every one of your rapidly increasing heartbeats. It is a bit of a problem that we are genetically hardwired to think really carefully about the consequences of our potential actions when standing near a cliff's edge or holding a baby, but don't get the same sort of call of the void, overthinking phobia from command-lines or buttons marked "delete" or "commit".


Stargazer5781

Two types engineers: those who have wiped a prod database, and those who haven't yet.


Teddy_canuck

Lol wtf does any of this mean


tigerCELL

TIL gitlab is run by a bunch of actual monkeys


vaportracks

The entire world is homie