Esperanto is still frustratingly complex with regard to phonemes for an international language. I think most speakers of many European languages don't realize just how complex their phonologies are on average. Slavic languages probably take the cake there with stuff like 5-consonant clusters that can even include sequences of plosives and affricates, but then you also have Germanic languages (and French!) with their insanely large vowel inventories. Compared to that, Esperanto is relatively simple, but when you look outside of Europe, having 3-consonant clusters or phonemic contrast between plosives and affricates at the same place of articulation (e.g. "t" vs "t͡s") is very unhelpful.
That said, it's still a massive improvement on English phonologically. Even if you only consider the simpler American varieties, the three-way æ/ɐ/ɑ distinction alone (as in bat vs but vs bar) is a huge WTF for anyone coming from a typical 5-vowel system. And then you have consonants like θ and ð that don't have clear 1:1 counterparts in most other languages, often not even as allophones of something else that you could point at.
Still, if you want to see what a more modern take on the concept might look like, I believe Globasa (https://www.globasa.net/eng) is the most active project along those lines. Of course, realistically, the likelihood of it actually being adopted as the universal language is effectively nil, but then that's also the case for Esperanto.
The problem of consonant clusters is a bit overstated IMO. I once saw some English speaker complaining about how difficult it is to pronounce Russian, like take a phrase "на встрече" — there are 4 consonants at the start of that second word, wtf! — even though that cluster is exactly the same as the one in the English phrase "of strength"; this cluster even undergoes essentially the same simplification/reduction in both languages.
But I agree that overly large phonemic inventory is a problem. On the other hand, it seems that languages either have a complex consonant system but a simple vowel system; or a simple consonant system but a complex vowel system; I haven't yet seen a language where both systems are simple (Japanese vowels have tonality, so it's not a simple system IMHO), probably because the words in such a language would have to be quite long.
Learned phonotactics matters, though, and many languages distinguish between what's allowed on word or morpheme boundaries vs what's allowed within a single syllable - so even if phonemically it's the same cluster, it can still be difficult to learn to pronounce it correctly in the second case.
There are quite a few languages where both vowel and consonant systems are simple - just look at Polynesian languages such as Māori. The latter's vowel system is 5-vowel, and "long vowels" are phonemically vowel sequences that span moras. But, yes, it does mean that you end up with long words such as "whakararurarutia". That said, it's a rather extreme case, and one can still construct fairly simple but rich consonant systems in practice, because it's basically combinatorics - adding just one more bit of information doubles the domain space! So e.g. if you start with a strict CV consonant system and allow C(l/r)V, that's almost 4x as many contrasting syllables. Make it C(l/r/w/y)V(C) like in Globasa, and even with considerable restrictions on clustering stops etc this is enough for most words to be 3 syllables or less, and for most function words to be 1 syllable.
Yes sure, there are a lot of things were Esperanto is not an ideal of linguistic easy-to-learn and easy-to-use fully-neutral perfection communication mean.
Now, the real success of Esperanto is that it does have an over 1 century international active community that does produce it’s own cultural artifacts, using Esperanto as a communication mean. All that without a bound army to back it at any point, that’s probably an unique feat in human history. Also to make it clear, it was not meant to be a universal language, but an international one.
Personally, I love that projects like Globasa comes to live. On a pragmatic level, large scale adoption is unlikely, but that is the case of any human endeavor. Let’s make sure that grandiloquence result likeliness never inhibit beautiful dreams being pursued.
That said, it's still a massive improvement on English phonologically. Even if you only consider the simpler American varieties, the three-way æ/ɐ/ɑ distinction alone (as in bat vs but vs bar) is a huge WTF for anyone coming from a typical 5-vowel system. And then you have consonants like θ and ð that don't have clear 1:1 counterparts in most other languages, often not even as allophones of something else that you could point at.
Still, if you want to see what a more modern take on the concept might look like, I believe Globasa (https://www.globasa.net/eng) is the most active project along those lines. Of course, realistically, the likelihood of it actually being adopted as the universal language is effectively nil, but then that's also the case for Esperanto.