EBCDIC Is Incompatible with GDPR (2021)

Semaphor 2 days ago

Thought that sounded familiar. Needs (2021)

https://news.ycombinator.com/item?id=38009963 2023, 467 comments

https://news.ycombinator.com/item?id=28986735 at the time in 2021, 267 comments

thyristan 2 days ago

The article is from 2021, the original court decision it is writing about is from 2019. Old news.

The machine-readable parts of government issued passports also do not adhere to that ruling, as do many government IT systems in Europe. The fallout from that ruling has been underwhelming so far.

cjs_ac 2 days ago

It might be old news, but it's a good reminder that reality has a surprising degree of complexity, and that software engineers have a responsibility to properly represent and handle that complexity.
pepa65 10 hours ago

Makes you wonder if that is one of the reason they like to spell people's names in all uppercase. (Under the rules of some languages, the accents etc. "fall away".)
JdeBP 2 days ago

And the clickbait title isn't actually the conclusion, as the page cited by this one as its source even explains.

robertlagrant 2 days ago

> But, a decade after the seminal Falsehoods Programmers Believe About Names essay - we shouldn't tolerate these sorts of flaws.

It's nothing to do with that. It's the decades of work by the Unicode Consortium that makes this possible.

oaiey 2 days ago

It is funny how my mind tricks me into search for privacy problems, but the reality here is a violation against the right to correct your data (which is a privacy problem, but not something you would think immediately of)

jwr 2 days ago

The GDPR is often misunderstood and/or misrepresented — it's mostly about giving you agency over your personal data.
Which (I happen to think) is a very good idea.

SpaceL10n a day ago

For those who don't know...

EBCDIC can be pronounced as ebb-sid-ick in conversation

pepa65 10 hours ago

We always said Ebkdik, but that was in Dutch conversation (the first K is barely audible further into the conversation, so more like Ebdik...).
ddmf 15 hours ago

I pronounce it ebb-ka-dick kinda similar to liebe dich.

nuc1e0n 2 days ago

Yes, but not because EBCDIC is not ASCII based. All 8 bit character encodings are incompatible with GDPR, because they cannot represent everyone's names. There's an extended EBCDIC the supports the full Unicode range that is GDPR compliant. That said UTF-8 is still a better choice now.

lifthrasiir 2 days ago

> There's an extended EBCDIC the supports the full Unicode range that is GDPR compliant.
If anyone is interested, that's UTF-EBCDIC [1]. In reality even IBM itself didn't use that encoding though.
[1] https://www.unicode.org/reports/tr16/tr16-8.html
- le-mark a day ago
  
  I think the parent is referring to IBMs DBCS double byte character set which more like a weird utf-16.
  - lifthrasiir a day ago
    
    AFAIK they are not a single unified character set (e.g. IBM Korean DBCS-PC is an extension to EUC-KR), so any of them can't support the full Unicode range at all.
rlpb 2 days ago

Here in the UK, there is no "legal name" AIUI - just what I am known as that then ends up in my official government documents such as my passport and my driving licence. If I make up a symbol to represent myself, then where are we with GDPR compliance if I demand that symbol be on my official documents?
It seems reasonable to limit compliance to some specific character set. Which character set should it be? Just one that can accurately encode all "official" languages in the region?
- jeroenhd 2 days ago
  
  There's a reason the court bothered in this case. The bank (ING if I recall) in question has been promising to fix these issues for years because someone decided they could "just" migrate to a new system and all the legacy problems would be a thing of the past. Deadlines were missed, repeatedly, and the whole process has been a disaster outside of the name thing.
  Furthermore, this happened in Belgium, a country with at least three official languages and with enough friction between different groups that there is a legal requirement for law enforcement to talk to you in your own language (i.e. Dutch in the Francophone area and French in the Dutch-speaking area).
  Also, I think GDPRhub has the most apt take of the whole situation:
  > A correctly functioning banking institution may be expected to have computing systems that meet current standards, including the right to correct spelling of people's names.
  Honestly, it's ridiculous that a bank can even operate a country without being able to store common names. The banking system isn't from the 70s either, it was deployed in the mid nineties, two years after UTF-8 came out, and six years after UCS-2 came out.
  If I start a bank in the UK and I my system can't render the letter "f", I expect someone to speak up and declare how ridiculous that is. This is no different.
  - Symbiote a day ago
    
    > If I start a bank in the UK and I my system can't render the letter "f",
    I wonder how many British banks can support a name like Llŷr, which has several notable living people:
    https://en.wikipedia.org/wiki/Ll%C5%B7r_(given_name)
- nuc1e0n 2 days ago
  
  What the 'reasonable' character set for names should be is more a philosophical question than a technical one. I think the unicode set of characters fully contains it. But would someone declaring their name is a unicode emoji be 'reasonable'? I think it would not be. Perhaps only certain script ranges within unicode?
  - zokier 2 days ago
    
    Plenty of countries have relatively strict laws about legal names.
    https://en.wikipedia.org/wiki/Naming_law
    Furthermore there is a standardized subset of Unicode codepoints which is intended to encompass all the legal names in Europe:
    https://en.wikipedia.org/wiki/DIN_91379
    > normative subset of Unicode Latin characters, sequences of base characters and diacritic signs, and special characters for use in names of persons, legal entities, products, addresses etc
  - Piskvorrr 2 days ago
    
    Once you start taxatively naming "these are the Only Blessed Ranges," you'll be bitten by the usual brouhaha "email address ends with .[a-z]{2,3}". We all know how it went, and ".[a-z]{2,4}" didn't cut it, either, not even in 2000.
    
    nuc1e0n a day ago
    
    To add to the complexity, not all Chinese characters in use for names are representable in unicode. Perhaps at some point legal institutions must just define what the list of characters is that people can have as part of their name as listed on documentation. This reminds me of that 'what programmers believe about names' article from a while back.
    
    pepa65 10 hours ago
    
    If so, I think they would just need to be added to Unicode. Do you have an estimate how many are missing?
    
    bmn__ 8 hours ago
    
    I as an interested bystander estimate it in the order of 10⁵. Email Ken Lunde for better insights.
    Note that GP claimed "not representable" (not "not represented"). Based on what I know, that claim feels quite wrong.
    
    bmn__ a day ago
    
    > not all Chinese characters in use for names are representable in unicode
    Why? How do you come to this conclusion?
    
    SAI_Peregrinus a day ago
    
    Han unification[1] prevents the representation of all Chinese characters. There are multiple languages that use Chinese characters, but they don't all use the same characters. Unicode decided to only use Han Chinese characters, so names using other sorts of Chinese characters can't be written with Unicode. The Han "equivalent" characters can be used, but that looks weird.
    Think of it as though Unicode decided that the letter "m" wasn't needed to write English text, since you can just write "rn" and it'll be close enough. Someone named "James" might want to have their name spelled correctly instead of "Jarnes", but that wouldn't be possible. Han unification did essentially this.
    [1] https://en.wikipedia.org/wiki/Han_unification
    
    bmn__ a day ago
    
    I feel it's unlikely that this the explanation for what GGP had in mind. I postulate that names characters usually have no variants, thus do not undergo unification, or where there are variants, they are already encoded as Z variants, so the contention is also moot.
    Prove me wrong with a counter-example.
    
    SAI_Peregrinus 10 hours ago
    
    https://soranews24.com/2014/02/13/japanese-woman-celebrates-...
    First search result.
    
    bmn__ 8 hours ago
    
    𫟈 is U+2B7C8 "CJK Unified Ideograph- 2B7C8". 𛁻 is U+1B07B "Hentaigana Letter To-5".
    Both character fall into the first category I mentioned, no variants.
- Piskvorrr 2 days ago
  
  The Artist Formerly Known As Prince has entered the chat.
  (the name was a unique symbol)
  Yes, all abstractions leak, there will always be edge cases. Doesn't mean "JUST USE ASCII DUH" (the lowercase extension is for the wimps); a whole spectrum exists between these extremes.
  - msla 2 days ago
    
    Right. Unicode is the current best effort good faith way to include everyone.
    It isn't perfect, and there's always a way to subvert good faith if you want to make a point or just be an asshole. The Unicode Consortium is working on the first, and the second can be handled by the majestic indifference of bureaucracy.