Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Domain Name Analysis (datagenetics.com)
45 points by pcopley on June 8, 2012 | hide | past | favorite | 26 comments


When coming up with a new business name, sure, it's probably possible to find a suitable name in .net space, but these days, why bother? Unless it's unique you'd not be able to find the same name free in the .com space, which is where everyone would probably look in first. Better to simply research/brainstorm further and find a name you can acquire/repurchase in the .com arena and bypass all the confusion/customer education.

On that note, I recently launched a new domain search tool called Lean Domain Search [1] which makes finding available .com's infinitely easier than it's ever been. It pairs your search term with 2,500 other keywords commonly found in domain names and instantly shows you which are still available, returning on average 1,200 available domain names per search.

Given the abundance of great .com's still out there, there is no good reason not to use a .com for your site over any of the other TLDs especially since as the author points out, for most normal people websites === .com.

[1] http://www.leandomainsearch.com


I don't know much about the business of domain names, so this may be a dumb question, but... if this service got popular, could you use some of the common (and/or recent) search terms to inform speculative domain purchases? If so, are you concerned that would make your target audience less likely to want to use a 3rd party service like this (b/c they don't want their potential domains snatched up)?


Of course I could do that. As a web app developer, I could also sell details about who my users are, what their passwords are (if I didn't salt+hash them), what actions they've taken on my sites, etc. Both would ruin my businesses and be highly unethical, which is why I would never do either.


"Oops, we ran into a problem processing this search term. We've been notified and will fix the issue within the week.

If you have any questions, please contact matt@leandomainsearch.com or say hi @mhmazur."

I tried two words: "power tube"


Will check it out -- thanks.


There is a lot of good competition in this space.

We have quite a number of algorithms also (we used the whole of wikipedia's ngrams among others).

http://nametoolkit.com/


They misdefined "TLD" as the "Amazon" in "Amazon.com" when it actually refers to the "com".

Such a glaring ignorance makes it hard to trust the author's domain expertise. Pun intended. Reading the rest of the article proved my instincts.


Where do you see flaws in the analysis?


I am the article author.

Thank you to the numerous people who took to time to email me and correct me about a definition. In this article, I refer to the entire root of a domain name e.g. Amazon.com as the TLD. I made a mistake, it is just the .com component of this name that is the TLD. I hope this error didn’t mask the enjoyment of the article for you. I appreciate all the feedback I receive.


You're welcome.

Your graphs on the frequency of each length domain name is misleading as it is too easy to interpret it to mean that the most popular lengths are 10 or 11 characters long, when in fact shorter names are more popular but have a limited since shorter means fewer combinations. You discuss saturation later, but it would be more informative to combine the two pieces of information. For example, on the same graph you could plot a ceiling line representing the total available combinations for each length. It would be obvious that the frequency bars for lengths less than 10 are shorter only because they're bumping up against the ceiling.

Secondly, a lot of experts disagree with you on the importance of having a .com domain name, and many successful sites are on different domains.

Thirdly, what actual utility do the couplet/triplet and start/end character data and graphs provide?

If I were looking for an expert to select a domain name, I would choose someone who understands what matters, not someone buried in inconsequential minutia.


Tough Crowd, Tough Crowd :)

The graphs aren't misleading (IMHO), they show the distribution of lengths. However you slice it, if you have to type the domain name, you have to push that number of keys. It's far more important to know that, than the ratio of the length normalized by the possible combinations of characters at that length. (I did try looking at that as well, but the graphs were almost meaningless - even at log scales, the increase in the number of combinations of characters dwarfs the number of names, and after you get more than 10 characters, the percentages drop to so small that comparing is meaningless).

"a lot of experts disagree with you on the importance of having a .com domain name". Well, yes they can! So go with the wisdom of the masses. It's a free market, and as a company you can select your own domain suffix. Yet most go for .COM, because well, right or wrong, that's what people expect. (I will agree that, in the end, it's not as important because, if you read my article I state that many now find web sites through typing keywords in search engines, and the exact domain is not important), but, Ask youself the question though: if you are company XYZZY, and there is a XYZZY.COM domain there, and it's not yours would you be just as happy with XYZZY.NET and not worry about it (Listening to those experts?) I think not, you want to preserve your brand, and avoid confussion, and make it as easy as possible for the masses to find your site (the non-experts who make up the majority of your consumers). It's the tastes of the fish, not the tastes of the fisherman after all :)

This article was written for fun. The couplet/triplet was generated out of interest and to see common combinations of letters. I find it fascinating, and I'll be happy to explain some business utility of it if you want to send me a personal email.

I'm not trying to sell my services as an expert domain name seller; it's not what I do. I make a living as someone who mines data and helps find 'inconsequential minutia' in data to leverage (when dealing with hundreds of millions of users, moving the needle just fraction of a percent can make a difference to a bottom line).

But anyway, the article was created as a trivia/fun article. I'm sorry you don't find it interesting/relevant. (Though again, wisdom of the crowd: since someone posted it here this morning, my inbox/twitter has been alive with comments/retweets about how fun and interesting an article it is - to date, it's been one of the most promiscious articles I've written)


Here's another analysis I did a few years ago... how many dictionary words are taken???

Might give you some ideas: http://blog.hotnamelist.com/2009/02/are-all-good-com-domains...


"It’s interesting to note that the distribution differs from the the traditional pattern used in the English lanuage: E,T,A,I,O,N,S,H,R,D,L … Some of this can be explained by the fact that domain names are not just for the consumption of English speaking people. Even though other regions have their own domains, since .com has become the lingua franca, many businesses simply default to .com (For those interested, there is an interesting article on Wikipedia about the differing relative frequencies of letters in other languages)."

That may be part of it, but the author doesn’t recognize at all the likelihood the letter I is used more frequently probably due to Apple’s product naming influence, imitation from other companies pre-pending the letter before their prouducts and services, and the fact that ‘I’ is a strong, powerful pronoun.


The "i" does make a difference, sure, but not as big an influence as you might think. You can only put the "i" infront of so many words, and if you look at the initial letter charts, it's not massively dominant there.

Far more important, for instance are the substrings like "FREE" (which can apply to all things, not just computer related) and this has a couple of "E"s, or anything that has the "%ING%" substrings (which is a very common letter combination in the English language)


I won’t cite any sources, but ETAOINSHRDLU is well-established as a fairly accurate English letter frequency. The point in the article is the frequency found in domain names has a "higher" frequency of I’s and a relatively lower number of T’s, despite ETAOINSHRDLU, but doesn’t really explore why. (-ING endings are already taken into consideration with ETAOINSHRDLU.)

Also, I have to disagree with you; there are thousands of companies that have capitalized on the Apple product ecosystem (iSkin, iLounge, iPodResQ, etc.) and in the commonly associated abbreviation of “Internet” to i. I would say there are many more prefixes with ‘i’ than ‘e’ or any other letter.


Please do cite sources; It's what adds weight to your comments and differentiates them from speculation.

Yes ETAOIN SHRDLU CMFWYP VBGKQJ XZ is an accurate distribution of letters in common English text BUT this is a distribution of letters in written English. As it turns out, however, written English is full of very common little glue words, like THE, OF, AND, A, TO, IN, IS, YOU, THAT, IT ...

One third of all printed English material is made up of the top 25 words, and the most common 100 words account for almost half. Domain names are not typically sentences, and are often just one or two words. Instead the frequency of occurence of letters in distinct words should be used. For all distinct words in the English dictionary, this distribution is a little different: ESIARN TOLCDU PMGHBY FVKWZX QJ

Already "T" is much further down the list.

Interestingly, this distibution varies by length of word. By the time we get to words of length 13, for instance the order has changed to IENTS ... (So already the letter "I" is the most common letter for longer words without any influence of Apple).

You can read about a full analysis of the distribution of letters and see a complete table of letter frequency against word length here: http://www.datagenetics.com/blog/april12012/index.html

There may be "thousands of companies" that have added an "I" to their compnay name (though it would help your arugment if you quote sources). But even so, this is dwarfed by the 102 million names. Even tens of thousands of new "I" companies is a fraction of a percent change against this denominator.

There are millions of companies/organisations that have domain names, and not all are tech related.

I'm happy to continue debating and will gladly run any queries you suggest using the entire domain name database and the English language database to generate numbers.


I don't know if it is fair to say that a period is part of the domain name. It just separates out the subdomain. news.ycombinator is the subdomain 'news' on the ycombinator name, whereas dashes have no actual information pertaining to them.


I'll not agree or disagree :)

I simply processed the file provided by the good folks at Verisign, and used however they classified things.


Random fun fact: Sanford Wallace once crashed AOL's mail server by sending mail that was allegedly from "howmuchwoodcouldawoodchuckchuckifawoodchuckcouldchuckwood.com".


Having access to a database of domain names, I decided to run some more analysis on the .com and .net databases.

Anyone know of such a database that is also public?


If you sign a legal agreement with Verisign and have a legitimate need, they will grant access. Good Luck.



Heh, I pioneered this space - http://blog.yafla.com/Interesting_Facts_About_Domain_Names

I accidentally discovered that any chimp could sign up to receive the database, did that basic analysis, and have watched as it rinses and repeats every six months or so.


Cool!


I'd love to compare the domain metrics with english language as a whole, or with a popular encyclopedia


A couple months old, but there's some really interesting data analysis in here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: