Front-end: support punycoded diaspora IDs

comradesenya · September 6, 2017, 2:26pm

There is a character encoding called punycode which is used for non-latin domains (e.g. cyrillic, or just random unicode domains). There is nothing that prevents people from creating pods on such domains. Actually there are some pods which use unicode domain names already (see: https://свидетеливилли.рф/).

It works totally fine, but it renders the domain in it’s fallback look xn--..., instead of its unicode representation. So it would be great to actually fix our UI to show these domains in their pretty unicode representation.

example

But it raises a few questions.

If we support punycoded domains, then why don’t we support punycoded usernames? If we have to convert punycode to a pretty look, then why not converting usernames at the same time? IMO this feature makes diaspora more friendly for non-latin native writers and I think many people would love it to have non-latin diaspora IDs

We currently support hyphen as a part of user IDs, though our sign up page doesn’t allow to post it. So this is purely frontend-level issue.
We have to do the same conversion on user input, not only at rendering. Because if a user sees микола@діаспора.укр as a diaspora ID then they will enter it to the search input and as a mention ID. So we have to make our frontend to punycode the ID before posting it to the backend.
Nobody has every layout available at their keyboard, so if someone has a unicode diaspora ID, it will be hard to type it for people who don’t have this chars available (e.g. when typing a mention). So possibly we need to make it configurable, e.g. have a little button somewhere at the current page which disables punycode rendering and shows fallback instead.

jhass · September 7, 2017, 6:28am

Punycode normalization over the entire ID on render and user input would be awesome, however the federation and database should probably enforce the encoded variant.

I don’t think we need a toggle, most users will already copy paste most of the time and second users that feel strong enough about their native language to use their script in their username usually don’t actually use another language in the content they post, IME. So I think it’s actually a really rare case where you would manually search for such while it’s not your main language either.

Little nitpick: Latin-1 totally has characters that would get punycoded, such as all the French and German ones for example. DNS is restricted to 8 bit encodings in theory and 7 bit ASCII in practice

denschub · September 8, 2017, 1:33am

I mostly agree with you here, but when implementing, UTR 36 and UTR 39 should be taken into consideration. I don’t want to allow vectors for people registring pod.example.com and pod.exampӀe.com (monspace font for added visual differences) or similar stuff just to trick people into misidentifying profiles…

spixi · September 19, 2017, 8:54pm

Hi @denschub,

The GNU library libidn already considers UTR 36 and UTR 39. Just check it out with the command line tool idn or idn -d (for decoding). And there is already a ruby-gem for it called idn-ruby.

For example, ligatures in words like ĳsselmeer are normalized to ijsselmeer. But there are also some issues left, e. g. https://ѕсоре/ is decoded to https://ѕсоре/ (Cyrillic letters), which is confusable with https://scope/ (Latin letters). But I noted, that even Firefox 53.0.3 has the same issue.

Sincerely,

spixi