I've been working on a similar product. Users can select between Streamlit/Shiny: https://editor.ploomber.io/ - so not necessarily for BI (although you can use it for that), but more broadly focused on data apps.
who is a great technologist with a lot of hands on experience. if it made sense to leverage papermill, he would have done so and focused on something else.
I'd say how much is good enough highly depends on your use case. For something that still has to be reviewed by a human, I think even .7 is great; if you're planning to automate processes end-to-end, I'd aim for higher than .95
author here: I'm working on a follow-up post where I benchmark pre-processing techniques (to reduce the token count). Turns out, removing all HTML works well (much cheaper and doesn't impact accuracy). So far, I've only tried gpt-4o and the mini version, but trying other models would be interesting!