Hacker Newsnew | past | comments | ask | show | jobs | submit | wanderingmind's favoriteslogin

There are many others that are better.

1/ The Annotated Transformer Attention is All You Need http://nlp.seas.harvard.edu/annotated-transformer/

2/ Transformers from Scratch https://e2eml.school/transformers.html

3/ Andrej Karpathy has really good series of intros: https://karpathy.ai/zero-to-hero.html Let's build GPT: from scratch, in code, spelled out. https://www.youtube.com/watch?v=kCc8FmEb1nY GPT with Andrej Karpathy: Part 1 https://medium.com/@kdwa2404/gpt-with-andrej-karpathy-part-1...

4/ 3Blue1Brown: But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning https://www.youtube.com/watch?v=wjZofJX0v4M Attention in transformers, visually explained | Chapter 6, Deep Learning https://www.youtube.com/watch?v=eMlx5fFNoYc Full 3Blue1Brown Neural Networks playlist https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_6700...


I’m not a data engineer but work in an adjacent role. Is there anyone here who could dumb the use case down? Maybe an example of a problem this solves. I am struggling to understand the value proposition here.

We use openwhisper for transcription which accepts a list of "words to look out for" which we populate with a short list of the names of all the people and companies most likely to be mentioned in the text, and then we do a spell checking pass at the end using Gemini with a much longer list, telling it to look out for anything that might be a misspelling.

It's not perfect, but it's taken it from being an issue that made all our transcripts look terrible, to an issue I no longer think about.

I imagine just using the second spellchecking pass with Gemini would be almost as effective.


The article doesn't really give helpful advice here, but please don't vibe this.

Create evals from previous issues and current tests. Use DSPy on prompts. Create hypotheses for the value of different context packs, and run an eval matrix to see what actually works and what doesn't. Instrument your agents with Otel and stratify failure cases to understand where your agents are breaking.


And congratulations to any of today's lucky ten thousand who are just learning of the Principal-Agent Problem.

https://en.wikipedia.org/wiki/Principal%E2%80%93agent_proble...


This remains one of the best explanations on the topic: https://fabiensanglard.net/floating_point_visually_explained... Saw this when I just started using HN and such posts only inspired me to stick to it: https://news.ycombinator.com/item?id=29368529

I was recently in the market for one of these! I ended up going with https://github.com/dbohdan/recur due to the nice stdout and stdin handling. Though this has stdout/stderr pattern matching for failures which is nice too!

Is anybody making smart glasses that are just a display? For me, the rest of the feature set verges on being anti-features. I'd much rather a very rudimentary display that my phone or another device could send relatively low bandwidth data to over bluetooth or some other protocol and build from there.

Having a camera or a mic on the glasses themselves seems like something I'd mostly want to avoid for privacy, and having a speaker just seems like gilding the lily when we already have a variety of headphones to choose from.


Years ago, I often struggled to choose between Amazon products with high ratings from a few reviews and those with slightly lower ratings but a large volume of reviews. I used the Laplace Rule of Succession to code a browser extension to calculate Laplacian scores for products, helping to make better decisions by balancing high ratings with low review counts. https://greasyfork.org/en/scripts/443773-amazon-ranking-lapl...

Periodic reminder to disable npm install scripts.

    npm config set ignore-scripts true [--global]
It's easy to do both at project level and globally, and these days there are quite few legit packages that don't work without them. For those that don't, you can create a separate installation script to your project that cds into that folder and runs their install-script.

I know this isn't a silver bullet solution to supply chain attakcs, but, so far it has been effective against many attacks through npm.

https://docs.npmjs.com/cli/v8/commands/npm-config


I am implementing oauth right now, along with oidc. I must say that for such a simple concept, getting to the facts that help me to actually implement it is insanely hard. I have no idea why but everywhere i look it just seems like it only scratches the surface and you get no tangible information that you can use to actually implement it in code. I ended up mostly browsing the specs and grok was insanely helpful to explain meaning of various things where information was lacking or buried deep in documentation/specifications. I would say this was the first time where i actually appreciated these new "AIs", which i don't use at all.

It's been many years without social networks for me. At gatherings, I'm often the only one without a phone in my hand, and it feels strange. Eventually, the "phoners" make eye contact and chat a bit, usually about something they all saw on a screen. But it never lasts. They always go back to the screen. It seems silence and quiet time make them uncomfortable. Even in a formal business meeting, screens are open, and attention is lost.

Will decentralized social networks fix this plague? I don't think so. The only thing that works is disconnecting. Just a few weeks into it, you'll realize you have so much free time. Time for hobbies, time for loved ones, time for finding peace and joy, time for creating and sharing. You will regret the thousands of hours wasted on that useless addiction. A few months in, you'll hear the birds singing again. You'll notice the evening skies. You'll find comfort and joy. The time you get back will help you build incredible things.


Julian Hyde (Apache Calcite, Google) gave a crisp presentation on this and how SQL could express 'measures' to bridge the gap: https://communityovercode.org/wp-content/uploads/2023/10/mon...

> A semantic layer, also known as a metrics layer, lies between business users and the database, and lets those users compose queries in the concepts that they understand. It also governs access to the data, manages data transformations, and can tune the database by defining materializations.

There's also now a paper: https://arxiv.org/pdf/2406.00251


So is this like Valetudo[0] but for mowers? Very cool! I wonder how much overlap / shared code there is between robot vacuums and robot mowers.

[0]: https://valetudo.cloud/


Recently tried out the new GEPA algorithm for prompt evolution with great results. I think using LLMs to write their own prompt and analyze their trajectories is pretty neat once appropriate guardrails are in place

https://arxiv.org/abs/2507.19457

https://observablehq.com/@tomlarkworthy/gepa

I guess GEPA is still preprint and before this survey but I recommend taking a look due to it's simplicity


I work at Google on these systems everyday (caveat this is my own words not my employers)). So I simultaneously can tell you that its smart people really thinking about every facet of the problem, and I can't tell you much more than that.

However I can share this written by my colleagues! You'll find great explanations about accelerator architectures and the considerations made to make things fast.

https://jax-ml.github.io/scaling-book/

In particular your questions are around inference which is the focus of this chapter https://jax-ml.github.io/scaling-book/inference/

Edit: Another great resource to look at is the unsloth guides. These folks are incredibly good at getting deep into various models and finding optimizations, and they're very good at writing it up. Here's the Gemma 3n guide, and you'll find others as well.

https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-...


Model cards, for the people interested in the guts: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...

In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:

- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.

- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.

- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)

All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.


I've been using Open WebUI and have been blown away, it's a better ChatGPT interface than ChatGPT!

https://github.com/open-webui/open-webui

Curious how this compares to that, which has a ton of features and runs great


Happy long term user, great project. Here is a list of Open Source Apps, I use to replace Google stuff:

  Aurora Store - Anonymized frontend for Playstore
  F-Droid - Open Source App Store
  Obtainium - App Store for other sources (e.g. github)
  Organic Maps - Open Source navigation (not as good as proprietary ones though)
  SherpaTTS - Text to speech for Organic Maps
  PDF Doc Scanner - Little Trickster, Open Source document scanner
  Binary Eye - Barcode reader
  K9 Mail / FairMail - Mail client
  LocalSend - Cross Platform File Transfer
  Syncthing Fork - Catfriend1 Syncthing fork to sync files
  VLC Media Player - media player
  KOReader - ebook reader
  Voice - Paul Woitaschek, local audiobook player
  AudioBookShelf - Remote audiobook player
  Immich - image backup
  Fossify File Manager - file manager
  Substreamer / DSub - Audio streamer for navidrome self hosted server
  OpenCamera - Open Source camera app
I wish I had this list from the start... Hope it helps someone :-)

Defining async is hard. And I'm writing this as one of the many people who designed async in JavaScript.

I don't quite agree with the definition in this post: just because it's async doesn't mean that it's correct. You can get all sorts of user-land race conditions with async code, whether it uses `async`/`await` (in languages that need/support it) or not.

My latest formulation (and I think that it still needs work) is that async means that the code is explicitly structured for concurrency.

I wrote some more about the topic recently: https://yoric.github.io/post/quite-a-few-words-about-async/ .


I love how these stories always start with “I just wanted to scratch my own itch” and end with “...and now I’m running a company with a payroll bigger than my old day job.” It’s inspiring, but also a little bit intimidating. Makes you wonder how many potential seven-figure ideas are just sitting in people’s “maybe someday” folders. The real lesson here? Ship something, even if it’s ugly. You can’t optimize what doesn’t exist.

They don't link to the Form S-1 prospectus from their announcement, but it's publicly available at https://www.sec.gov/Archives/edgar/data/1579878/000162828025...

Their highlighted metrics page: $821M LTM revenue, 46% YoY revenue growth, 18% non-GAAP operating margin, 91% gross margin.

It's an incredible success story, and the engineering they did upfront (primarily led by co-founder Evan Wallace) that set the stage for their success is the stuff of legends. https://madebyevan.com/figma/ has links to numerous blog posts breaking it down, but here are some choice quotes:

> [Evan] developed the hybrid C++/JavaScript architecture for Figma's editor that made it possible to build a best-in-class design tool in the browser. The document representation and canvas area is in C++ while the UI around the canvas is in JavaScript (the team eventually settled on TypeScript + React for this). This let us heavily optimize the document representation to reduce memory usage and improve editing speed while still using modern UI technologies for fast iteration on our UI. C++ development was done using Xcode (not in the browser) to provide a better debugging environment.

> Even though the contents of Figma documents are similar to what HTML can display, Figma actually does all of its own document rendering for cross-browser consistency and performance. Figma uses WebGL for rendering which bypasses most of the browser's HTML rendering pipeline and lets the app work closely with the graphics card. The rendering engine handles curve rendering, images, blurs, masking, blending, and opacity groups, and optimizes for high visual fidelity.

> [Evan] developed Figma's multiplayer syncing protocol, worked on the initial version of the multiplayer live collaboration service (a kind of specialized real-time database), and added multiplayer syncing support to Figma's existing editing application. The initial version was written in TypeScript but [he] later ported it to Rust for improved performance and stability.

It's a great reminder that it's not premature optimization if your UI's fluidity is your distinctive feature and your calling card! And the business acumen to turn this into such a wildly successful product in the context of competitors with kitchen-sink feature lists can't be understated, either. I have an incredible amount of respect for this team, and they should inspire all of us to tackle ambitious projects.


As far as I'm aware, this is the largest Normalizing Flow that exists, and I think they undermined their work by not mentioning this...

Their ImageNet model (4_1024_8_8_0.05[0]) is ~820M while AFHQ is ~472M. Prior to that there is DenseFlow[1] and MaCow[2], which are both <200M parameters. For more comparison, that makes DenseFlow and MaCow smaller than iDDPM[3] (270M params) and ADM[4] (553M for 256 unconditional). And now, it isn't uncommon for modern diffusion models to have several billion parameters![5] (from this we get some numbers on ImageNet-256, which allows a direct comparison, making TarFlow closer to MaskDiT/2 and much smaller than SimpleDiffusion and VDM++, both of which are in billions. But note that this is 128 vs 256!)

Essentially, the argument here is that you can scale (Composable) Normalizing Flows just as well as diffusion models. There's a lot of extra benefits you get too in the latent space, but that's a much longer discussion. Honestly, the TarFlow method is simple and there's probably a lot of improvements that can be made. But don't take that as a knock on this paper! I actually really appreciated it and it really set out to show what they tried to show. The real thing is just no one trained flows at this scale before and this really needs to be highlighted.

The tldr: people have really just overlooked different model architectures

[0] Used a third party reproduction so might be different but their AFHQ-256 model matches at 472M params https://github.com/encoreus/GS-Jacobi_for_TarFlow

[1] https://arxiv.org/abs/2106.04627

[2] https://arxiv.org/abs/1902.04208

[3] https://arxiv.org/abs/2102.09672

[4] https://arxiv.org/abs/2105.05233

[5] https://arxiv.org/abs/2401.11605

[Side note] Hey, if the TarFlow team is hiring, I'd love to work with you guys


With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

In the idea of making more of an OpenAI minute, don't send it any silence.

E.g.

    ffmpeg -i video-audio.m4a \
      -af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
                         stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
                         apad=pad_dur=0.02" \
      -c:a aac -b:a 128k output_minpause.m4a -y
will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.

You need some technical specs on the website. How many DOF does it have? Does it have joint angle sensing? If so, what's the resolution? What's the interface to the servos? What's the payload capacity? Does it have integrated motor controllers? How long is it, and what does the dexterous workspace look like?

As a roboticist, what I'd vote for, in order, is:

- more degrees of freedom

- interchangeable tools, either an actual tool changer (unlikely at the price point) or a fixed bolt pattern with electronic passthroughs

- better joint sensing, e.g. absolute encoders, joint torque sensing

- fingertip force sensing


Nice. There's also a good VS Code plugin for doing this: https://marketplace.visualstudio.com/items?itemName=csholmq....

And of course, markdowntools (multiple conversion tools): https://www.markdowntools.com/


I don't believe the document does a great job in explaining what is otherwise a very simple idea (assuming I understood it well):

1. It creates a bitmap where each bit is a pixel in the image, if from frame 0 to frame 1 a given pixel changed, the corresponding bit is 1, otherwise it is 0.

2. All the 1s are added to the bloom filter, hashing their offsets. Now the bloom filter will be positive for all such indexes plus a percentage of false positive indexes.

3. We query the bloom filter to see all the indexes that are positive, and for all such pixels we store the raw pixel data of what changed. So we can reconstruct the next frame easily.

You can think at this like as storing the delta between two frames as: x,y,r,g,b of all the pixels that changed, but compressing a lot the x,y part at the cost of storing a bit more r,g,b than needed.

I have the feeling that since the pixels that changes from frame 0 to frame 1 are often similar (in their location) to what will change from frame 1 to frame 2, there is the possibility of further compressing that as well, by setting the right flags in the next frame and storing verbatim the only offsets that changed in addition to the previous or alike.


I've been having really good results from Jules, which is Google's gemini agent coding platform[1]. In the beta you only get 5 tasks a day, but so far I have found it to be much more capable than regular API Gemini.

[1]https://jules.google/


This solves a major problem that I built an npm package called "pgstrap"[1] for. It generates a "database structure" directory so that my database schema is available to LLMs (it also makes code review easier because you can see the changes to various tables). So I have a SQL file for each table in my database, neatly organized into directories for each schema. Rails has a similar idea with schema.rb

I'm not sure whether or not it's better to have your editor database-aware or to have your codebase have appropriate context committed. On one hand, less generated code/artifacts make for a cleaner codebase. On the other hand, not everyone uses VC Code or will know how to use this integration. Database browser GUIs have never really had a single winner. That said, VS Code does have enough dominance to potentially make themselves "the standard way to view a database in development"

[1] https://github.com/seveibar/pgstrap


You can try it on Android right now:

Download the Edge Gallery apk from github: https://github.com/google-ai-edge/gallery/releases/tag/1.0.0

Download one of the .task files from huggingface: https://huggingface.co/collections/google/gemma-3n-preview-6...

Import the .task file in Edge Gallery with the + bottom right.

You can take pictures right from the app. The model is indeed pretty fast.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: