> I always try to assume breach in my thought processes, but I recognize that this lead to overengineered solutions because sometimes the mitigation is not worth the cost.
I agree with this mindset, I do the same. But at the same time, yes you do have to realize that sometimes it's not worth it. For instance, there are two types of attack you might encounter, a strong nation-state and a drive-by botnet using known exploits and weak passwords to grab the low hanging fruit. If you are patched and using strong passwords, you aren't going to be affected by the drive-by botnet. If you are patched and using MFA and whatever strong credentials, a zero-day sat on by a nation-state is going to plow through anyway. Then they have gotten into that outer ring as a user and you are trying to protect against privilege escalation. Most things to protect against that here that are actually going to work are going to be strong process control or integrity checking (Windows), or Mandatory Access control systems (SELinux), or just basic user silo-ing and not running things as privileged accounts (either one). Most of that is going to be on the OS design itself or architecture of the process.
So we go to privilege escalation exploits. Take this year, at time of writing this is March. I have been patching nothing but privilege escalation flaws on Linux machines (I don't admin Windows, so I don't know that landscape) all year in 2022. It's only been three months. There's no short supply of them being discovered, and many of them are mildly, moderately, or entirely mitigated by just using SELinux. Some of them go all the way past it, though, so sometimes it can be futile.
So the nation-state threat in almost any case will likely have the ability to jump right past the zero-day to root level. So what about in-between? Well, learning about attack and if you are stockpiling or developing zero-days, those tend to add up quick or you just get locked out entirely because they get patched. Your skills also ramp up pretty quickly, too, as an exploit hunter. So you either develop a strong foothold or you fall out of the criminal world entirely. I'm sure it's probably the most paranoia-driven and stressful "job" to have while you are striving not to completely fall apart and get locked out due to defense ramping up or locked up (not that trying not to get hacked isn't paranoia-driven enough).
I also want to emphasize, you REALLY don't want to get compromised AT ALL at this point. Patching is probably the best way to do that, and the most important step. The reason being, you can't necessarily prove that you have kicked out the user after you think you have unless you just completely wiped the machine, and even then you have no idea if they got as far as a firmware exploit (in the instance of a nation-state), which is the more terrifying exploits that are being discovered and sought after.
But regardless, if you find out that you've been compromised and you're using a random password, you're going to change that password anyway if you are doing things right.
> I don't use SSH certificates at work because they really don't make sense for me when I am using a strong credential already (HSMs)
And that's a great point, too. HSMs are a great way to secure SSH as it is, and use the same or similar cryptography as SSH certs as long as they are well developed.
What comes to mind for me for a complicated environment where SSH certs don't help is that there might be inter-organizational issues where you have to make a connection work over multiple crazy hops. So for instance, an end-user's laptop has to connect to Citrix from home, then RDP into a local machine in organization A, then over an existing IPSEC tunnel use OpenVPN software to VPN into organization B, then SSH into a server in organization B. Organization B just did things using OpenVPN, and then SSH, but the rest had to be tacked on due to the client's environment. Real world example. So, the best usage in this case was for organization B to use Yubikeys in OTP mode to type the AES signed secrets typed as a keyboard through the multiple connections. Organization B had no control over organization A's infrastructure or ability to tell them to stop doing anything the way they were doing it, but had to consider the security implications of the way they had set their systems up anyway because the "client" was working in this environment. Then there was the issue of training the users, and explaining SSH certs OR keys to them would have been impossible. Telling them to hit a button was hard enough.
I've heard much crazier stories from the military involving piping encrypted sessions over satellite and jumping it over cable connections, etc (including patching live Super Bowl feeds over serial connections for officers which are always fun stories, especially when dealing with legal copyright issues involving the government in the 80s and fudging reasoning), but there are just some things when you are involved with multiple organizations or multiple connections or inter-organization or international things that you just can't control every single detail of. This is going to get more and more complicated as remote-work gets adopted more as well, so these old stories of network insanity are extremely useful for application level connectivity for sysadmins now.
Long story short, sometimes that thing you think is engineered terribly has a reason for it. Usually it involves stupid logistical nightmares, weird requirements, or bureaucratic/legal hopping. It's only going to get worse, too.
I agree with this mindset, I do the same. But at the same time, yes you do have to realize that sometimes it's not worth it. For instance, there are two types of attack you might encounter, a strong nation-state and a drive-by botnet using known exploits and weak passwords to grab the low hanging fruit. If you are patched and using strong passwords, you aren't going to be affected by the drive-by botnet. If you are patched and using MFA and whatever strong credentials, a zero-day sat on by a nation-state is going to plow through anyway. Then they have gotten into that outer ring as a user and you are trying to protect against privilege escalation. Most things to protect against that here that are actually going to work are going to be strong process control or integrity checking (Windows), or Mandatory Access control systems (SELinux), or just basic user silo-ing and not running things as privileged accounts (either one). Most of that is going to be on the OS design itself or architecture of the process.
So we go to privilege escalation exploits. Take this year, at time of writing this is March. I have been patching nothing but privilege escalation flaws on Linux machines (I don't admin Windows, so I don't know that landscape) all year in 2022. It's only been three months. There's no short supply of them being discovered, and many of them are mildly, moderately, or entirely mitigated by just using SELinux. Some of them go all the way past it, though, so sometimes it can be futile.
So the nation-state threat in almost any case will likely have the ability to jump right past the zero-day to root level. So what about in-between? Well, learning about attack and if you are stockpiling or developing zero-days, those tend to add up quick or you just get locked out entirely because they get patched. Your skills also ramp up pretty quickly, too, as an exploit hunter. So you either develop a strong foothold or you fall out of the criminal world entirely. I'm sure it's probably the most paranoia-driven and stressful "job" to have while you are striving not to completely fall apart and get locked out due to defense ramping up or locked up (not that trying not to get hacked isn't paranoia-driven enough).
I also want to emphasize, you REALLY don't want to get compromised AT ALL at this point. Patching is probably the best way to do that, and the most important step. The reason being, you can't necessarily prove that you have kicked out the user after you think you have unless you just completely wiped the machine, and even then you have no idea if they got as far as a firmware exploit (in the instance of a nation-state), which is the more terrifying exploits that are being discovered and sought after.
But regardless, if you find out that you've been compromised and you're using a random password, you're going to change that password anyway if you are doing things right.
> I don't use SSH certificates at work because they really don't make sense for me when I am using a strong credential already (HSMs)
And that's a great point, too. HSMs are a great way to secure SSH as it is, and use the same or similar cryptography as SSH certs as long as they are well developed.
What comes to mind for me for a complicated environment where SSH certs don't help is that there might be inter-organizational issues where you have to make a connection work over multiple crazy hops. So for instance, an end-user's laptop has to connect to Citrix from home, then RDP into a local machine in organization A, then over an existing IPSEC tunnel use OpenVPN software to VPN into organization B, then SSH into a server in organization B. Organization B just did things using OpenVPN, and then SSH, but the rest had to be tacked on due to the client's environment. Real world example. So, the best usage in this case was for organization B to use Yubikeys in OTP mode to type the AES signed secrets typed as a keyboard through the multiple connections. Organization B had no control over organization A's infrastructure or ability to tell them to stop doing anything the way they were doing it, but had to consider the security implications of the way they had set their systems up anyway because the "client" was working in this environment. Then there was the issue of training the users, and explaining SSH certs OR keys to them would have been impossible. Telling them to hit a button was hard enough.
I've heard much crazier stories from the military involving piping encrypted sessions over satellite and jumping it over cable connections, etc (including patching live Super Bowl feeds over serial connections for officers which are always fun stories, especially when dealing with legal copyright issues involving the government in the 80s and fudging reasoning), but there are just some things when you are involved with multiple organizations or multiple connections or inter-organization or international things that you just can't control every single detail of. This is going to get more and more complicated as remote-work gets adopted more as well, so these old stories of network insanity are extremely useful for application level connectivity for sysadmins now.
Long story short, sometimes that thing you think is engineered terribly has a reason for it. Usually it involves stupid logistical nightmares, weird requirements, or bureaucratic/legal hopping. It's only going to get worse, too.