I assume there's a certain danger in letting it consume data in real time. It wouldn't be hard to trick the web crawler into ingesting undesirable content, and people would quickly start asking it questions like "why is the metro down today?" or "do I need to worry about the hurricane that's forecast for tomorrow?" which it would struggle with. Not to mention how much AI generated data is now found across the internet.
It’s a fun test actually. There was a beta version of one of my libraries online before 2021, and when I ask ChatGPT how to use it, the answers are bad, but clearly it knows some correct things. I want to know if our current documentation of the full release is good enough to close the gap…
With a paid subscription to OpenAI you can fine-tune their models on additional data, so if you're a business trying to offer an AI based chat help or something this seems achievable.
I believe 2021 was the tipping point where most text content is now AI generated, so to avoid training your LLM with other LLM output they restrict the date to 2021.