[prev] [thread] [next] [lurker] [Date index for 2005/10/17]
* Peter da Silva <peter@xxxxxxx.xxx> [2005-10-17 20:15]: > > Plus null bytes can then be part of the data, so most > > charset-oblivious software breaks. > > I thought breaking 8-bit-only software was a good thing. I said charset-*oblivious*. A lot of software passes around strings without ever processing them. It would be pretty pointless to force that sort of code to deal with encoding issues; just make sure null termination continues to work and the software will happily work with Unicode as well as it does otherwise. A mailer OTOH, in its interaction with the terminal and with the editor used to compose mail, needs to understand charsets, at least insofar as that it needs to know how to convert them. > > Not worth it, considering that 99.99% of text processing is > > either gluing strings together without looking inside, or > > processing them character-by-character. > > Processing them character by character in UCS-4 is so much > easier than doing it in UTF-8. So is gluing them together. Walking through the string is marginally more difficult with UTF-8 than single-byte charsets, granted. That said, I’ve written the code to do it straight from spec, twice, and got it right on the first try both times. But where you get the idea that gluing them together is difficult, I have no idea. What’s difficult about strcat?! > > Blindly indexing into a string without having scanned it > > previously is so rare it doesn’t merrit consideration. > > Blindly indexing into a file without having scanned it > previously is so common that you don't even remark on it > happening. > > A file, remember, is a string. I don’t see how that is relevant. I can’t think of any common use case where you’d be seeking blindly into a text file; though plenty where you’d be seeking blindly within binary files. But I’m not talking about heaps of bytes. Text is not the same thing as a sequence of bytes, even though it’s stored in one. A heap of bytes can be interpreted as text only if you know what encoding it’s in. If you see 0x41 in a heap of bytes, that’s only “capital A” because ASCII says so. That doesn’t mean every heap of bytes should be encoded as if it were a bunch of UTF-8 codepoint. It’s not, it’s a heap of bytes. But text is a heap of characters, and text should be encoded into a heap of bytes by way of the UTF-8 encoding. Now if you’re processing text files, you don’t index blindly anyway, you scan for newlines (cf. character-by-character processing). And that works in UTF-8 as well as it always has with ASCII: the extra bytes of a multibyte character never conflict with the <127 codepoints, so you can just scan for 0x10. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>There's stuff above here
Generated at 20:00 on 17 Oct 2005 by mariachi 0.52