The statement about just throwing 200k tokens to get best answer for smaller dat...

The statement about just throwing 200k tokens to get best answer for smaller datasets goes against my experience. I commonly find as my prompt gets larger, the less consistent the output becomes, and the poorer following instructions becomes. Does anyone else experience this or a well known way to avoid this? It seems to happen at much less than even 25k tokens.