Friday, August 29, 2008

Turning Japanese: EUC-JP, UTF-8, and percent-encoding

In case I forget how to do this, and as an example of how easy it is to get sucked into a black hole of programming micro-details, I spent a hour or more trying to figure out how to handle Japanese characters.

I'm building a database of publications linked to taxonomic names, and I'm interested in linking to electronic versions of those publications. CrossRef and JSTOR provide a lot of references, as does BHL (once they get an OpenURL resolver in place), but there are numerous other sources to be harvested. One is CiNii, the Japanese National Institute of Informatics Scholarly and Academic Information Navigator, which have an OpenURL resolver. For example, I can query CiNii for an article using this URL
http://ci.nii.ac.jp/openurl/query?ctx_ver=Z39.88-2004&url_ver=Z39.88-2004&ctx_enc=info%3aofi%2fenc%3aUTF-8&rft.date=2003&rft.volume=58&rft.spage=1&rft.epage=6&rft.jtitle=Entomological%20Review%20of%20Japan.

If I want to harvest bibliographic metadata, I can parse the resulting HTML. I could follow the links to formats such as BibTex, but there's enough information in the link itself. For example, there's a link to the BibTex format that looks like this:

http://ci.nii.ac.jp/openurl/servlet/createData?type=bib
&ca=@article
&au=%B7%A6%CC%DA+%B4%B4%C9%D7
&title=%A5%AB%A5%DF%A5%AD%A5%EA%A5%E0%A5%B7%B2%CAPidonia%C2%B0%A4%CE%BF%B7%B0%A1%C2%B0%A4%CB%A4%C4%A4%A4%A4%C6
&jtitle=%BA%AB%EA%B5%D5%DC%C9%BE%CF%C0+%3D+The+entomological+review+of+Japan
&year=20030430
&vol=00058
&num=00001
&spage=1-6
&id=10011061577
&lang=jp
&issn=02869810
&publish=%C6%FC%CB%DC%B9%C3%C3%EE%B3%D8%B2%F1
&perm_link=http%3A%2F%2Fci.nii.ac.jp%2Fnaid%2F10011061577%2F
Note the percent-encoded fields, such as %B7%A6%CC%DA+%B4%B4%C9%D7. This string represents the author's name, 窪木 幹夫. It took me a little while to figure out how to convert %B7%A6%CC%DA+%B4%B4%C9%D7 to 窪木 幹夫. Eventually I discovered this table, which shows that there are a number of ways to represent Japanese characters, including JIS, SJIS, and EUC-JP. Given that C9D7 = 夫, the string is EUC-JP encoded. What I want is UTF-8. After some fussing, it turns out that all I need to do (in PHP) is:

$decoded_str = rawurldecode($str);
if (mb_detect_encoding($decoded_str) != 'ASCII')
{
$decoded_str = mb_convert_encoding($decoded_str, 'UTF-8', 'EUC-JP');
}
rawurldecode decodes the percent-encoding to EUC-JP, then mb_convert_encoding gives me UTF-8.
As an example, here is the above reference displayed by the bioGUID OpenURL resolver. A small victory, but it is nice to display the Japanese title. The English title of this article is "A New Subgenus of the Genus Pidonia MULSANT (Coleoptera: Cerambycidae)". It's perhaps the major triumph of Linnean taxonomy that even though I can't read a word of Japanese, I know the paper is about Pidonia.

1 comment:

sexy said...

情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣,情趣,情趣,情趣,情趣,情趣,情趣,情趣,A片,視訊聊天室,聊天室,視訊,視訊聊天室,080苗栗人聊天室,上班族聊天室,成人聊天室,中部人聊天室,一夜情聊天室,情色聊天室,視訊交友網

免費A片,AV女優,美女視訊,情色交友,免費AV,色情網站,辣妹視訊,美女交友,色情影片,成人影片,成人網站,A片,H漫,18成人,成人圖片,成人漫畫,情色網,日本A片,免費A片下載,性愛

A片,色情,成人,做愛,情色文學,A片下載,色情遊戲,色情影片,色情聊天室,情色電影,免費視訊,免費視訊聊天,免費視訊聊天室,一葉情貼圖片區,情色,情色視訊,免費成人影片,視訊交友,視訊聊天,視訊聊天室,言情小說,愛情小說,AIO,AV片,A漫,avdvd,聊天室,自拍,情色論壇,視訊美女,AV成人網,色情A片,SEX,成人論壇

情趣用品,A片,免費A片,AV女優,美女視訊,情色交友,色情網站,免費AV,辣妹視訊,美女交友,色情影片,成人網站,H漫,18成人,成人圖片,成人漫畫,成人影片,情色網


情趣用品,A片,免費A片,日本A片,A片下載,線上A片,成人電影,嘟嘟成人網,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,微風成人區,成人文章,成人影城,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,臺灣情色網,色情,情色電影,色情遊戲,嘟嘟情人色網,麗的色遊戲,情色論壇,色情網站,一葉情貼圖片區,做愛,性愛,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,美女交友,做愛影片

av,情趣用品,a片,成人電影,微風成人,嘟嘟成人網,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,成人文章,成人影城,愛情公寓,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,色情,寄情築園小遊戲,情色電影,aio,av女優,AV,免費A片,日本a片,美女視訊,辣妹視訊,聊天室,美女交友,成人光碟

情趣用品.A片,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,色情,寄情築園小遊戲,情色電影,色情遊戲,色情網站,聊天室,ut聊天室,豆豆聊天室,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,免費A片,日本a片,a片下載,線上a片,av女優,av,成人電影,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,成人文章,成人影城,成人網站,自拍,尋夢園聊天室