aickin's comments

aickin · on March 3, 2025

My startup just got its SOC2 Type 2 compliance, and it felt like the process was a lot harder than it should have been. I took the opportunity to write up everything I wish someone had told me about getting SOC2 Type 2 as a small startup, from timeline to costs to pitfalls, so that other folks wouldn't fall into the same traps I had. Hope it's helpful!

aickin · on Feb 19, 2025

Yep, that's also a huge issue with LLMs in production. Our product has some automatic detection of jailbreak attempts so that you can see when people are trying to jailbreak & prompt inject, but hallucinations are the biggest unsolved problem imo.

aickin · on Feb 19, 2025

Yeah, that's definitely another way to solve the issue. Of course, that can add a ton of operational complexity and means that you are responsible for fixing or upgrading the model if and when any security issues or other problems come up. And you can't use OpenAI, obviously.

Do you have a favorite infra for hosting models?

aickin · on Feb 19, 2025

People have been complaining about AI models surreptitiously changing underneath them for a while now, and we found evidence of it happening in the wild. We build an LLM monitoring and testing tool called Libretto, and we saw GPT-4o start to behave significantly differently on one of our prompts this week. This is a write-up of how we detected the change and what it means for building on top of LLMs that can change at any moment.

aickin · on May 1, 2024

Hey there, I'm the founder & CEO of Libretto, which is building tools to automate prompt engineering, and we have a new post about some experiments we did to see if few-shot examples' performance translates across LLMs.

We took a prompt from Big Bench and created a few dozen variants of our prompt with different sets of few-shot examples, with the intention of checking whether the best performing examples in one model would be the best performing examples in another model. Most of the time, the answer was no, even when we were talking about different versions of the same model.

The annoying conclusion here is that we probably have to optimize few-shot examples on a model-by-model basis, and that we have to re-do that work whenever a new model version is released. If you want more detail, along with some pretty scatterplots, check out the post!

aickin · on April 26, 2024

Hey there, I'm the founder of a company called Libretto, which is building tools to automate prompt engineering, and I wanted to share this blog post we just put out about empirical testing of few-shot examples.

We took a prompt from Big Bench and created a few dozen variants of our prompt with different few-shot examples, and we found that there was a 19 percentage point difference between the worst and best set of few-shot examples. Funnily, the worst-performing set was when we used examples that all happened to have a one word answer, and the LLM seemed to learn that replying with one word answers was more important than actually being accurate. Sigh.

Moral of the story: which few shot examples you choose matters, sometimes by a lot!

aickin · on March 30, 2023

imaginary.dev is a project I built to allow web developers to use GPT to easily add AI features to their existing web user interfaces. All a developer does is declares a function prototype in TypeScript with a good comment saying what the function should do, and then they can call the function from other TypeScript and JavaScript code, even though they've never implemented the function in question. It looks something like:

/* * This function takes in a blog post text and returns at least 5 good titles for the blog post.

* The titles should be snappy and interesting and entice people to click on the blog post.

*

* @param blogPostText - string with the blog post text

* @returns an array of at least 5 good, enticing titles for the blog post.

*

* @imaginary

*/ declare function titleForBlogPost(blogPostText: string): Promise<Array<string>>;

Under the covers, we've written a TypeScript and Babel plugin that replaces these "imaginary function" declarations with runtime calls to GPT asking GPT what the theoretical function would return for a particular set of inputs. So it's not using GPT to write code (like CoPilot or Ghostwriter); it's using GPT to act as the runtime. This gives you freedom to implement things that you could never do in traditional programming: classification, extraction of structured information out of human language, translation, spell checking, creative generation, etc.

Here's a screencast where I show off adding intelligent features to a simple blog post web app: https://www.loom.com/share/b367f4863fe843998270121131ae04d9

Let me know what you think. Is this useful? Is this something you think you'd enjoy using? Is this a good direction to take web development? Happy to hear any and all feedback!

aickin · on March 13, 2023

We've recently released a new library that implements a concept called "Imaginary Programming" for TypeScript. The idea is that you can incorporate Large Language Models (LLMs) like GPT into more traditional webapps by simply defining a function prototype with a good comment but without an implementation. Our library then finds those functions in your code and replaces them with a runtime call to GPT asking GPT what such a function would return for particular arguments if the function existed.

It's sort of a mind-bending paradigm, but I think it's really delightful to use from a programmer's perspective. Just write out the function name, the arguments that the function takes in, the JSON-compatible type it returns, and a descriptive comment saying what the function should do, and voila. You have yourself a function you can call, and it actually works.

To be clear, this is not the same thing as Github Copilot. Copilot is great at writing code for you, but Imaginary Programming doesn't write code. Rather, Imaginary Programming uses GPT as the runtime for your function. This means you can solve problems that you never would be able to code normally, answering questions like: "How angry is this customer email?", "What are all the proper names in this email?", or "Come up with a good name for this song playlist.". Copilot lets you code the same things faster; Imaginary Programming lets you do things that were previously impossible to code in TypeScript/JavaScript.

I'm interested in seeing what developers make of this, so I'd love for folks to try it out, either by installing the library or using our online playground: https://playground.imaginary.dev/ . Any and all feedback is very much appreciated!

aickin · on Sept 27, 2017

Three other (somewhat related) reasons I’ve heard stated by FB folks:

1. Makes it easier to identify talent out in the world that they want to go after.

2. Raises the technical reputation of Facebook, making it easier to get candidates to say yes.

3. Makes it easier to ramp new employees up on Facebook’s stack if they are already familiar with a lot of it from open source.