Hacker News new | past | comments | ask | show | jobs | submit login
Parsing the Infamous Japanese Postal CSV (dampfkraft.com)
168 points by polm23 on Nov 8, 2020 | hide | past | favorite | 40 comments



Props to the author for putting in the hard yards here. I used to work with a major geocoding service, and Japanese addresses were a nightmare, both in the original Japanese and when translated to English. In many places in Japan (Kyushu, Okinawa and Hokkaido are particularly bad), the kanji readings of names are completely impenetrable even to native Japanese speakers:

https://soranews24.com/2016/12/01/w-t-f-japan-top-5-most-ins...

And then you get to deal with this kind of extra weirdness:

https://en.wikipedia.org/wiki/Japanese_addressing_system#Spe...


A couple of "favorite" exceptions not on Wikipedia:

House numbers are usually of the form 1-2-3. Places with just two numbers are common, they haven't caught up with the "new" law from the 1962 yet.

Numbers are often written without suffixes, but in most places with three numbers the suffixes are 丁目-番-号. In one city - Wakasa in Fukui - the last two are reversed. I have never been able to find any mention of this at all, let alone why it would be the case.

Some cities are laid out in a clock pattern with slices named after animals of the Chinese Zodiac.

Neighborhoods are typically divided into blocks with numbers (or the number is omitted if there's just one block). But as addresses change over time, sometimes just one block goes away, so there are many places where block 1 no longer exists or there are gaps in the numbering.

On Yahoo Answers I once saw a person who was confused after their local post office adamantly told them one of their two or three numbers had a hyphen in it (so it was like 1-(2-3)) because it represented two joined lots, but that shouldn't be possible.

Mail is always complicated... I worked at a mail store in high school in the States and remember shipping some weird stuff.


> Places with just two numbers are common

Hoho, I've got a fun one. My house is TOWN 632-3 but not 丁目 番 号 anything. There is a 6丁目32番3号 in our city which is _not_ my house. This lead one night to me driving across town to pick up a lost friend.

Our house was in a development soon post war which for ages was surrounded by nothing but rice paddies. We've had a couple online shops which assume we're missing portions of our address. Overall though I like havinng only two numbers. Makes me feel like I own a whole town block.


Sounds like you have a chiban address: https://en.wikipedia.org/wiki/Japanese_addressing_system


Oh indeed you are 100%. Thank you for that link, it explains so much.


Oh, that sounds awful. I have a C in my address and sometimes it's fine and sometimes it's a whole thing explaining that it's not 4, it's C.

That said, two-number adresses are pretty common, I'm surprised you run into trouble with that. I live near Tokyo Tower, and even in the heart of Tokyo Minato-ku is only 99.XX% three-digit addresses. Azabu Nagasaka and Azabu Mamiana are tiny neighborhoods that are both still two digits (and Mamiana's name is another weird story...).


Wow this is something else. I thought we had it bad in Scotland, but this sounds worse.

In Scotland there are quite a few places where the council use a different tenement flat numbering scheme from the post office. Edinburgh council for example https://www.edinburgh.gov.uk/downloads/file/24357/statutory-....

> It is recognised that Edinburgh has a unique character which also translates into the flat numbering systems used. > Edinburgh has two main flat numbering systems in operation; the traditional tenement numbering system e.g. > GF1,1F1 and the modern flat numbering conventions e.g. Flat 1, Flat 2. Where development takes place within > properties with the traditional tenement numbering, this numbering system will be retained. New development will > be allocated the modern flat numbering convention. > Properties in common stairs must be allocated a main street number. Numbers are then allocated internally to each > flat for example, Flat 1, Flat 2. For the traditional tenement numbering system, flats are allocated numbers in the form > 1F1, 1F2, etc. 1F1 should be interpreted as 1st Floor, Flat 1. > The rotation of the internal numbers follows the rotation of the staircase, with the highest number being located at > the door furthest from the last riser on the stair.

But the post office (by default) uses a simple numbering scheme, which is the de facto standard even though the "legal" address is the one set by the council. Absolute nightmare.


I suspect similar situations arise, on a smaller scale, all over the place. I live in a house (29 Acacia Road) divided into three flats, known as Flat 1 29 Acacia Road, Flat 2 29 Acacia Road, and Flat 3 29 Acacia Road to the council, but known as 29 Acacia Road, 29A Acacia Road, and 29B Acacia Road to the Royal Mail. And as FLAT GND FLR 29 ACACIA ROAD, FLAT 1ST FLR 29 ACACIA ROAD, FLAT 2ND FLR 29 ACACIA ROAD to the Valuation Office Agency. Until very recently, they were known as something else to the Land Registry!


Argh I live in house that's been converted into two flats. One system calls us (to borrow your example) 29 and 29A, another calls us 29A and 29B, and assorted outliers call mine Downstairs, Ground Floor, Grannyflat ..

It must be worse for upstairs, who are simulataneously 29A and 29B. I'm only sometimes 29A but never 29B.

We also have three postcodes, but we're not sure which belongs to who.

Luckily the two flats / six addresses all share the same letterbox.


To add ot that, the "simple" numbering scheme is usually denoted by e.g. "9/8" - Building 9, Flat 8, (which depending on the building layout might be 9 3f2 or 9 2f4). Infuriatingly, many postcode systems will autofill the "9/8" format into an immutable address field, and then reject the address with a forward slash in it...


Lothian Valuation Joint Board seem to have two official postcodes for my place. I've been sent post from them where either has been used.


The author of your article:

https://soranews24.com/2016/12/01/w-t-f-japan-top-5-most-ins...

seems to be unaware of the existence of jukujikun:

https://en.wikipedia.org/wiki/Jukujikun

They don't seem to understand how kanji work in general (kanji are symbols, not phonemes; they're more like numerals or emoji than letters). It's as if the article were saying:

"If 1 is pronounced 'one', why is 11 pronounced 'eleven'? My dictionary doesn't list 'ele' OR 'ven' as a valid pronunciation of 1!"

While jukujikun are really common in names (including place names and person names), they're not specific to names. For instance, 啄木鳥 (kera) meaning "woodpecker" is more kanji than syllables, and 今日 (kyou) meaning "today" is also jukujikun.

Jukujikun need to be learned like any other word (you're not going to be able to read it correctly the first time you see it, but that goes for any other kanji reading you've never seen before), but they would not surprise a native Japanese speaker ("today" is not a rare word in Japanese).


Pretty sure Japan's addressing system is also responsible for several kanji in Unicode where nobody is quite sure how they go there.


Some of the ghost characters do appear to originate from place names, though that doesn't really have anything to do with the addressing system or any postal authority. Of the ones I checked more closely, like 膤, they tend to refer to things that wouldn't necessarily show in an address, like a hill or something.

https://www.wdic.org/w/CUL/%E8%86%A4 https://www.dampfkraft.com/ghost-characters.html


>> An example of another comment is 一円. Normally this would mean "one yen", but it also means "the area surrounding", and is a note in the CSV that should be removed from neighborhood names, except for exactly one neighborhood in Shiga where that's actually the name (〒522-0317).

Reminds me of the town called Street.

https://www.google.co.uk/maps?q=Street,+Somerset

There's more great examples of Japanese addressing quirks in Falsehoods Programmers Believe About Addresses.

https://www.mjt.me.uk/posts/falsehoods-programmers-believe-a...


円 is also the kanji for "round" and "circle"


Jesus Christ. What a nightmare! Glad to see you've done all the work. By the way, why did you choose to have invalid identifiers in the project README example and then say the example will work if you rename them? Surely that's better if you just renamed them.

It is a bit whimsical, though, so I understand marking Tokyo Tower by its Unicode emoji (what a thing to be in Unicode) for amusement.


I originally wrote that bit of code for a code screenshot, and actually thought it would work since I heard Python had added support for Unicode variable names. Shortly afterwards I realized it was invalid, but people seemed to like it so I kept it around. I probably should change it so people have something to copy/paste though...


Python does support Unicode Variable names but only for characters which represent written languages: https://python-3-for-scientists.readthedocs.io/en/latest/pyt...

Specifics here: https://www.python.org/dev/peps/pep-3131/#specification-of-l...

So no emojis alas.


I recently bought a house here (in Japan), and our new address is

[something] 3-45-6

But apparently, the same is true for all the other people that bought their house on the piece of land where two old houses were torn down to make place for 6 new ones (parcels and houses).

I cannot wrap my head around the fact that they have a whole new building and land record, but it doesn’t have a unique address.

Uber eats is seriously confused about our address, and is guaranteed to send someone to the other end of the street instead.


I remembered a few years I ago I tried to parse this monster of a CSV file to try to extract as much area name as possible. I was trying to create a new Japanese town name list for OpenTTD, since the existing one only have ~600 names.

Since I was just scraping for names, I just ignored all the lines that are hard to parse. Kudos to the author for actually parsing it properly.


This sounds like the flip side of Postel's law.

"Be conservative in what you do, be liberal in what you accept from others"

https://en.wikipedia.org/wiki/Robustness_principle

And here we have a hero cutting through japanese red tape like Ikiru


My address also always completes to "XXX-borough (except the following buildings)." I hope one day more sites migrate to Posuto on the backend to clean up the mess.


Fun fact: a number of years back I wrote a bash one-liner that attempted to parse this beast. Sadly it wasn't 100% accurate, but it was pretty darn close!

This thing truly is a gem.



This has a lot of similarities to the infamous ULS database published by the FCC, though this actually looks a little bit easier to work with. The ULS database doesn't even follow a CSV-like format, because despite all of its variations, one of the two newline forms are ultimately what separates records, and with the ULS, you can't make that assumption (unescaped newline characters are valid values). Nothing less than a customized parser will suffice.



My impression in Japan was that people use phone numbers to identify destinations. Translating phone-numbers into GPS-coordinates seemed to be common.


I had no idea what you were talking about, but looking it up it does look like it's possible to use phone numbers in navigation systems, primarily numbers of businesses. I was able to find several accounts of the feature using the keywords カーナビ 電話番号検索 - using just ナビ gets a lot of hits for directories of phone numbers (mainly used for identifying spam callers), and using GPS instead of ナビ gets hits for tracking cell phones.

I have never used this and didn't realize it was possible, but I don't drive and rarely take taxis. I can see why this would be popular, especially with tourists - the language isn't a barrier to conveying a number. It also avoids issues with address formatting.


I've never heard of that. Is it a recent thing (past five years) or a really old thing?


It's still common to use phone number to input location for car navigation because its Japanese input system is not good. Phone number is easiest but sometimes points wrong position if you use older maps. We won't use it on smartphone's map app.


~2 years ago. Navigation of rental cars worked best with phone numbers. Some cab drivers also preferred phone numbers. Most likely some service to lookup phone-number->GPS became popular - I haven't seen this outside Japan so far.


Off topic but I got curious about the Kopyleft mention at the bottom of the article. I was familiar with the notion of Copyleft, but not with the variant with a "K" ; so I googled it and found this article: https://en.wikipedia.org/wiki/Wikipedia:Anti-Wikipedianism

> One of the main sources of Anti-Wikipedianism is the radical far copyLeft (also referred to as Kopyleft, with the Communist K, i.e. Das Kapital). They expect Wikipedia to forbid content from being used for commercial purposes.

But on the article, it is written "Do as you like.". So I am a bit confused. Can we really do "as we like", or can we use this content for commercial purpuses..?


The Kopyleft thing started in the 60s with the Principia Discordia. Same with the "do as you like." It has nothing to do with Wikipedia.


This sounds like a good task for ragel http://www.colm.net/open-source/ragel/


Why?


Ragel (or more like finite state machines) excel at parsing streams with all sorts of nuances while still being rigid and bailing on invalid input. Considering the ken_all.csv isn't really standard csv anymore, it's better to write tailored parser for it, than to approach it as a typical csv file.


I thought you meant using Ragel would be better than https://github.com/polm/posuto what Posuto is currently doing.


Nah, didn't imply that. Just that ragel would be suitable to generate the parser for this kind of data.


coming from a non-tech and Japanese background, can someone ELI5?




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: