This entry made me smile. First, the title with its language caught my eye. Then, as I reached the end of the line and saw the domain name, my tired and sleepy face broke into a full-bodied grin.
fish is my most-favorite tongue-in-cheek technical blogger ever. He doesn't blog too often, but they're almost always very useful (and very witty!) gems.
(Hint: you can find out who he really is by reading far back enough.)
Linux, FreeBSD, Solaris:1 and Hackers' Lounge:2 were the first major time sinks of my digital life in the late 90s and early 2000s. ychat and ymsg were my intro to network protocols. I learned Makefiles, C, C++, and Python so I could build, understand, and modify programs like curfloo, curphoo, zinc, and gchat+/gyach. Although I moved on to IRC, I'll never forget when someone passed me "Smashing the Stack for Fun and Profit" (and the implicit link to Phrack) or the first time a bot I created logged in and took drink orders. The pride the first time I wrote a patch for my client to handle a login change all by myself.
A lifetime later, I know I wouldn't be where I am as a programmer if not for the encouragement of friends (some of whom I still talk with and do business with) I made as a teenager on yahoo chat. A lot of people I know grew up on IRC, but Yahoo Chat was my first love. Good bye, I'll miss you!
I had a really similar experience. As a bored teenager in the late 90s, I would sometimes log into the Yahoo chat rooms using the computers at my high school. The chat client was a Java applet, and it was atrociously slow and prone to crashing. I had started teaching myself to program a few years earlier, so creating a chat client seemed like a fun project.
There was a guy that hung out in the Programming:1 chat room with the chat handle was 127001 (he was able to create that handle before Yahoo became more strict with the chat handles one could create) that had reverse-engineered the protocol and posted a guide online. I wonder what happened to him. Thanks, loopy!
Anyway, I created a Yahoo chat client in Win32/C as I was learning the API and the language. I was the only user for quite a while, but eventually I posted it using the free web space EarthLink gave me for being a dial-up subscriber. It got pretty popular through word of mouth. Later, when I installed Linux for the first time, I took that code and turned it into an ncurses-based client. It was a lot of fun, and somehow I ended up turning the fun I had hacking on little projects like that into a living.
> or the first time a bot I created logged in and took drink orders.
My first experience with programming was writing bots in the mIRC scripting language. After first getting things to work the feeling was truly empowering. I have to say though, my next attempt in programming was in C and I found it extraordinarily difficult to get started in. Having said that, I was 14 at the time.
I think the fact that my first programming experience was with a scripting language really affected how I how I think about programming now. Ever since then I've found it easier to learn scripting languages than systems languages. Maybe that doesn't make sense to some but that's just me.
At Microsoft, I once wrote code for a public facing login system where you had to make sure usernames didn't contain some banned words. I was surprised by how many words on the list were new to me and how, err, creative people get.
I always wondered who at Microsoft was tasked with keeping that list 'up to date'.
These lists can be very locale insensitive. A few years back my friend was trying to get a Gmail id with his name. To his surprise he wasnt able to get any id, no matter how long and obscure suffix he tried. Then it hit me that it was because of his name Kshitij.
Systems should take terms from other languages and locales as white lists in these cases.
The world filter on the PS3 chat is my favourite example of this. It was almost impossible to chat with friends during gaming session when for example "Akuma" (A Street Fighter character and a perfectly safe word in Japanese as well) gets censored into "a---a", and Several Finnish words like "pelata" (to play. "pela" is something offensive I guess?) "mutta" ('but'... apparently it considers "mutt" a bad word.) and anything containing "ass" (ssa is the inessive suffix and Finnish has vowel harmony. Fun.) and lots of others get completely mutilated.
Analogous to that, in Germany don't beep out any swearwords on television, which is why American celebrities appearing on TV here swear vedammt oft. For example this eminem interview
When I watch Misfits (UK show) they say some things that I never hear on American shows. It only happens on shows that are after the watershed[1] though. Before then I don't know which parts are censored.
It always amuses me a little how Americans seem to be more strict than the Europeans for media censoring.
Since this has been downvoted, allow me to explain. The parent to my previous comment implies that Europeans are more tolerant of swearing than Americans, but they (or at least their governments) verge on paranoia when it comes to any mention of Hitler or the Third Reich.
No, there's nothing against mentioning Hitler or the Third Reich. As it's well known in the Web, full movies about Hitler and the Third Reich have been made even in Germany, and they weren't banned or its producers arrested.
The town of Cumming GA is subject to some interesting objections. Entering the address, one system refused with the comment "My, what colorful language you have."
It's not just swear-words: A credit card processor I used to work with wouldn't accept a payment from Mr. Echo because they regarded it as an attempt to hack them.
it's not online only either, I had university classes with a guy whose family name sounds like a swear word, and since to attend exams you had to preregister on a big dead-tree book left alone, usually teachers assumed it was a joke and didn't account for him.
At my fathers server farm, they had a scottish colleague called "Ronald McDonald". That regularly led to hotlines hanging up when he tried to order replacement parts...
Oh man, when I worked there in Localisation for Office, there were crazy length lists for every language, it was so much fun to read through with foreign friends!
Not multilingual, but open source, is the lovely collection of lists included with Dan's Guardian (In configs\lists\phraselists in the source, available from http://sourceforge.net/projects/dansguardian/ )
Maybe someone should ask Dan if we can stick them in plain text somewhere easily accessible for future projects to reference. I know we don't really need to ask, but it's polite.
We could then let the moderators of certain subreddits, or 4chan, have commit access to add new, creative terms of profanity!
I think this would be a humorous list to have around, if only to find out what people are being offended by these days. I have a feeling that, for every term, phrase or word you care to quote, there'll be someone who's offended by it.
Which brings me to the point that there has been a fair bit of debate over the use of profanity filters (a few good links at http://stackoverflow.com/questions/273516/how-do-you-impleme...) and how effective they are. One of the references in the link above is for a 14yo circumventing a profanity filter (based on a white-list) with the phrase "I want to stick my long-necked Giraffe up your fluffy white bunny."
That's brilliant! Are you sure you wouldn't consider providing an updated 'top 100' list as a service to anyone who felt they could use them in some hitherto-unknown way?
Also, hope it's not a sore point, but are Google still being unreasonable about the citations on your site? If so, is there anyone from Google reading this that can have a look into this for him? It seems a bit unfair (to say the least) that a dictionary is penalized for citing sources, surely that's just responsible editorial! (Link with some info if you're interested http://onlineslangdictionary.com/pages/google-panda-penalty/)
Are you sure you wouldn't consider providing an updated 'top 100' list as a service to anyone who felt they could use them in some hitherto-unknown way?
I'd absolutely love to. But with Google penalizing the site for the majority of the past 2 years, I've become extra-sensitive about content on my site being available anywhere else on the web. In another world, I'd be ecstatic any time I came across material sourced from the site. But as it stands, I've given some thought to filing my first DMCA requests - thus becoming part of that chilling effect that gave chillingeffects.org its name.
I have put the data to some good use. http://www.offensivest.com/books/ ranks English works in the Project Gutenberg corpus by vulgarity. The site desperately needs some TLC: at the least, tweaks to the methodology and a page explaining what that methodology is.
I have more ideas for using the data, but I spend 90% of my time trying to get rid of the penalties.
So...
Also, hope it's not a sore point, but are Google still being unreasonable about the citations on your site?
I'm sorry to hear that. I hope someone with some influence realizes the silliness (not to belittle the situation) of this whole affair. I presume sites like the Urban Dictionary (http://www.urbandictionary.com/) get away scott free by not providing any source links at all!
You have my promise, at least, that I won't reproduce any of your work (until you deem it good to go) except in the form of drunken pub factoids :-)
I love, for example, that the complete works of William Shakespeare is currently number 4 on the most vulgar books list!
It's certainly interesting as a list of sex slang but the ranking doesn't make much sense. A lot of these are ridiculous phrases and somehow "come down" as a simple euphemism is rated worse than a bunch of variants of the F word. (or "come down" as in drugs losing effect but that's even more baffling to be one of the most offensive words)
Please, no. Dumb people are already dumb enough without another list to refer to. My city recently renamed a street because somebody who worked at a company on that street found "Morning Glory" (a common enough flower) in Urban Dictionary.
I've always considered "money shot" to be some gambling term until one of the topics here on HN had me look it up in Urban Dictionary. I think UD is a bit like a medical textbook - sometimes it's better for you not to look into it too much, if you're not a professional.
It really would be the best repo. I can see the commit logs being the stuff of legends!
I don't see much use in profanity filters on the net these days, but it is definitely useful for businesses working with external teams just to sanity check content before publishing :)
Yeah, but then again I love occasionally coming across the rant of some angry dev in source comments (see http://www.vidarholen.net/contents/wordcount/ for an analysis of profanity in the linux kernel source).
Oh yes, in source & commits it's fair game! I'm talking more about editorial content that might have been outsourced. E.g. how-to articles for a company product or articles written in-house that are localised by an external vendor. These things need to be quite clean!
It's possible that they used a Bloom filter to compactly represent the list of banned words. This would allow them to share the filter without explicitly sending the list of banned words, so you'd never see a "foul-mouthed" network packet. But it could also return false positives for random strings, like perhaps "sffcei".
I know that "88" is used as a surrogate for "Heil Hitler" in neo nazi circles (H is the 8th letter of the alphabet), but "Realist" doesn't ring any bell.
As long as you don't combine it with other words you should be fine. In my example, Realist referred to "racial realism". I chose a relatively subtle example because many of them I find repulsive to even type, but a less subtle username I've seen about the place is "ChuckSpears". That's the kind of thing I'm talking about.
ChuckSpears is a direct reference to a racist insult, "Spear Chucker". They don't generally block names like this but I often wonder where the line is.
In human-moderated systems like XBox Live I've seen some relatively 'sophisticated' offensive names get called out and banned by the moderators.
I was wondering how username-blacklist systems worked, and how they deal with names which are obviously offensive to anyone who's dealt with the likes of Stormfront et al. before, but could theoretically be chosen by a totally innocent unfortunate.
I see what you're getting at and agree it could be offensive but it could also be Charles Spears. At the same time someone could name an account CharlesSpears and probably no one would find it offensive.
It is interesting to think about where the line should be drawn. Especially in human moderated cases.
Yeah I figured out what the insult was, I just have no idea what group it is applied too. If you need to know about it beforehand it's not a very strong insult.
I can't answer your question, but in this example, it's the context, not that name, that is offensive. I doubt it's even linguistically possible to make a reliable classifier.
If you're going to spam the system, why not just use Win32 API's and control their client like a zombie instead of trying to play the cat and mouse game reverse engineering their protocol?
You can send instructions to manipulate the client and have it do what you want almost as easily as you could if you knew their protocol. And that's without all the cat and mouse headaches.
I used to write bots for Yahoo's (and a few others) chat rooms back in the mid 90s. Nothing malicious, I was just a bored teenager so we're talking pretty dumb stuff like spamming naughty words in public chat rooms, etc. Anyhow, my point was this: the bots I wrote did just what you described, I used Win32 APIs to control the chat clients. As you said, it took only a few seconds to bang out the code and the bot could work on against multiple different chat clients with very little additional work.
The only time I've every bothered to write my own chat protocol client was for IRC in the late 90s (at which point I'd left college, was working full time so had turned my programming skills to good) and the only reason I bothered to write my own IRC client was because I was utterly fed up with the quality of Windows clients.
Back in the old days, Win32 APIs were so insecure that you could have all sorts of fun with them. I'm not sure how things stand these days though; my development these days are almost exclusively Linux and UNIX based.
I think he's asking how it got so obfuscated in the first place. Seemingly, the spammers could just pilot a standard up to date client with a bot, rather than try and figure out the protocol.
I'm sure it's more complicated than that however, like client side rate limiting or something.
I bet they would put the rate limit in the client, since they did that with the banned word list, but rate limiting would make more sense at the server. Maybe they did both.
Now you could fire up a new VM for each client or use a botnet to do your bidding. Oh, how the Internet has advanced.
You can basically read anything from controls, and trigger callbacks at will, as if you had actually clicked a button or written some text. This means that you can write "expect-like" software -- just start up the program, and have another program read input from it's text fields and issue commands to it.
I have actually done a lot of this, putting old sourceless win32 and win16 programs run in the background on virtual machines on the server and building new web-based interfaces on top of them.
Actually it ranges from simple event spoofing (user clicked here, user dragged there) to injecting a DLL + spawning a thread under your control.
Event spoofing is pretty limited. While having a thread under your control gives you full power as you have full access to the process' memory and can call any function you want.
That's what I meant to use it for but the API is used at the root of applications to draw windows, handle mouse click events, accept keyboard input, create icons in the system tray and anything else that would involve Windows UI.
In the same way applications use the win API to create their UI, others could use it to manipulate and control the interface of other programs. It's powerful.
What a charming and informative website! So refreshing to see an original design instead of another Wordpress, etc. site. Check out this fascinating note on how grep manages to be so fast by being secretly slow:
I had to work on implementing fairly well-known advertiser's banned-words list into our system recently. Their system is incredibly stupid. For example, they'll ban the word "Dimethltryptamine", but "Dimethyltryptamine" (the correct spelling) is allowed. You're not allowed to use the word "bulldog" under the Pets category. You can't use the word "dragon" under Jobs. It seems like they just have a form that visitors to the site can use to link ads containing "offensive" words and the system will automatically add it to a list.
Awesome! Network sniffing and analyzing is one of my favorite hobbies since being introduced to it as a security analyst for an MSSP and then writing Wireshark plugins at Sandia. If anyone is looking for a fun Saturday, I highly recommend http://forensicscontest.com/.
Thanks for pointing this out, I had really never noticed that before. I feel slightly silly now -- obviously the nyud.net was just being passed as in the query string...
Perhaps they did filter messages on the server as well. The client side word-list could be used for display purposes only, so if the user typed a censored word, the client could update the display with that word redacted and without waiting for a round-trip to the server.
They didn't -- the list was sent from the server, because "this list might need to be updated dynamically, in case someone on the Internet managed to think up a new word for sex."
The list sent was supposed to be used by the client to filter. What I think the OP is suggesting is that they should have just filtered server side not client side.
I suppose British people never really used the chat to talk about paedophiles. At least they looked for actual words rather than any occurrence in the string.
fish is my most-favorite tongue-in-cheek technical blogger ever. He doesn't blog too often, but they're almost always very useful (and very witty!) gems.
(Hint: you can find out who he really is by reading far back enough.)