Hacker News new | past | comments | ask | show | jobs | submit | 2snakes's comments login

Pooh. Winnie the Pooh. <3

It's possible the parent comment was thinking of the Soviet version, Винни-Пух, Vinni Pukh. The drawing style is different than Disney's, but also really cute.

I read one characterization which is that LLMs don't give new information (except to the user learning) but they reorganize old information.


Custodians of human knowledge.


That’s only true if you tokenize words rather than characters. Character tokenization generates new content outside the training vocabulary.


All major tokenisers have explicit support for encoding arbitrary byte sequences. There's usually a consecutive range of tokens reserved for 0x00 to 0xFF, and you can encode any novel UTF-8 words or structures with it. Including emoji and characters that weren't a part of the model's initial training, if you show it some examples.


Pretty sure that we’re talking apples and oranges. Yes to the arbitrary byte sequences used by tokenizers, but that is not the topic of discussion. The question is will the tokenizer come up with words not in the training vocabulary. Word tokenizers don’t, but character tokenizers do.

Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.

“If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”

"If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."


Those tokens won't come up during training, but LLMs are capable of In-Context Learning. If you give it some examples of how to create new words/characters in this manner as a part of the prompt, they will be able to use those tokens at inference time. Show it some examples of how to compose a Thai or Chinese sentence out of byte tokens, and give them a description of the hypothetical Unicode range of a custom alphabet, and a sufficiently strong LLM will be able to just output bytes despite those codepoints not technically existing.

And like I said, single-byte tokens very much are a part of word tokenisers, or to be precise, their token selection. "Word tokeniser" is a misnomer in any case - they are word piece tokenisers. English is simple enough that word pieces can be entire words. With languages where you have numerous suffixes, prefixes, and even in-fixes as a part of one "word" (as defined by "one or more characters preceded or followed by a space" - because the truth is more complicated than that), you have not so much "word tokenisers" as "subword tokenisers". A character tokeniser is just a special case of a subword tokeniser where the length of each subword is exactly 1.


Why stop there? Just have it spit out the state of the bits on the hardware. English seems like a serious shackle for an LLM.


Kind of, but character-based tokens make it a lot harder and more expensive to learn semantics.


Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.

“If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”

"If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."


-Your own- yes, but Cloudflare is extremely easy.


Hayy ibn Yaqdhan Nature vs nurture and relative nature of intelligence iirc


Disgusting!


Real yoga is possible.


That has what to do with this person saying a death is sad?


Bodies and minds can be transcended beyond suffering.


Mochi / Henry Meds. Mochi is the cheapest.


I just went through the quiz at Mochi and it said I was eligible for their nutrition program but not medication. The FAQ says your BMI has to be over 30 or 27 if you have some other health condition.


Take my advice at your own risk, but nobody is checking your math.

I was 10 pounds or so from qualifying, so I fudged my numbers a bit. Didn't make sense to force myself to gain weight so I could lose weight.

Places like OrderlyMeds doesn't even require a telehealth visit, just the questionnaire and a photo.


One thing that LLM can do besides generate code is explain complex code too. So that is inherently an upskilling feature.


Our company never adds comments, because the code speaks for itself. And with genAI I can have these comments added in very low time, helping me to get an overview of what happens.

But as for the "why are are doing this in the first place" business documentation is usually outside the source code and therefore out of reach of any genAI, for now.

As for what senior devs should do when coding: > They're constantly:

> Refactoring the generated code into smaller, focused modules

> Adding edge case handling the AI missed

> Strengthening type definitions and interfaces

> Questioning architectural decisions

> Adding comprehensive error handling

Ain't nobody got time for that! The one girl and other guy that could do this, because they know the codebase, have no time to do it. Everyone else works by doing just enough, which is nearly what TDD dictates. And we have PR code review to scrape up quality to barely get maintainable code. And never overcomplicate things, since writing code that works is good enough. And by the time you want to refactor a module three years later, you would want to use another data flow style or library to do the work altogether.


Yes, there was also a thing known as the Great Media Debate in the 90's iirc. MOOCs have a less than 10% completion rate. Media by itself is not the answer. What makes a big difference is teaching techniques and things like superhuman spaced repetition and system adaptation. See how Math Academy does it: https://www.mathacademy.com/pedagogy


I have yet to take a MOOC that is marginally engaging and teaches something new.

My only guess is that they are there for the certs.


Psychology of hunger behaviors.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: