I am at a medium-sized startup owning an internally used self-serve analytics service with a reasonable load. I am the sole owner of this service with thousands of lines of code. Since I am the only guy working on this, I take all the architecture decisions, do DevOps like setting up step functions, API Gateways, giving accesses, setting up staging infra, writing feature code, raising PRs, reviewing and deploying them myself, test on staging, and do sanity on production. However, I have fumbled often while releasing new features most recently today resulting in downtime for many users throughout the day. Even though I come up with creative ideas and write good code, the optics of my work are getting bad because of the bad releases. The reasons for the bad releases are 1. I did not do enough load-testing 2. Since this service is constantly updating, I frequently fumble with git. like accidentally pushing testing code/hardcoding onto prod 3. there are lots of flows in the service, so missing out on testing one of them. 4. other notable issues like bad queries from analytics team.
I am always the kind of guy who is creative, authentic and believes in move fast and break stuff. However, my company expects me to move fast but don't break anything. I was also told that I make small mistakes from time to time. (Most recently, I forgot to turn on a cron on prod). I do love my work and I think the decisions I make and the code I write are of high quality but these issues are affecting my optics. (I have trouble concentrating for long hours. I work in short bursts if that's worth anything). How do I overcome this situation, and improve my optics and my confidence? Thanks
Load test constantly. My policy is to (almost) never develop using "sample data". Instead, I take a very large example of real world data (say 95th percentile of what is actually used in the wild) and develop with that as my backing data. If operations are slow enough for me to be annoyed in development, clearly they will be too slow for the (many more) people who have to work with the project once complete.
> 2. Since this service is constantly updating, I frequently fumble with git. like accidentally pushing testing code/hardcoding onto prod.
Lock the `main` branch, only allow commits to it from PR's. Review your own PR's.
> 3. There are lots of flows in the service, so missing out on testing one of them.
Does making a change in one flow tend to adversely affect seemingly unrelated others? That might be an engineering shortcoming you should address. Besides that, automated testing. Some stacks allow "recording" a flow, then automatically making sure that same flow can happen on every PR. See point 2.
> 4. other notable issues like bad queries from analytics team
There are no bad queries, only insufficient validation, timeouts, and/or load balancing.