Re: Significant whitespace (was Re: Blogging sucks)

[prev] [thread] [next] [lurker] [Date index for 2005/10/17]

From: A. Pagaltzis
Subject: Re: Significant whitespace (was Re: Blogging sucks)
Date: 19:47 on 17 Oct 2005
* Peter da Silva <peter@xxxxxxx.xxx> [2005-10-17 20:15]:
> > Plus null bytes can then be part of the data, so most
> > charset-oblivious software breaks.
> 
> I thought breaking 8-bit-only software was a good thing.

I said charset-*oblivious*. A lot of software passes around
strings without ever processing them. It would be pretty
pointless to force that sort of code to deal with encoding
issues; just make sure null termination continues to work and the
software will happily work with Unicode as well as it does
otherwise.

A mailer OTOH, in its interaction with the terminal and with the
editor used to compose mail, needs to understand charsets, at
least insofar as that it needs to know how to convert them.

> > Not worth it, considering that 99.99% of text processing is
> > either gluing strings together without looking inside, or
> > processing them character-by-character.
> 
> Processing them character by character in UCS-4 is so much
> easier than doing it in UTF-8. So is gluing them together.

Walking through the string is marginally more difficult with
UTF-8 than single-byte charsets, granted. That said, I’ve
written the code to do it straight from spec, twice, and got it
right on the first try both times.

But where you get the idea that gluing them together is
difficult, I have no idea. What’s difficult about strcat?!

> > Blindly indexing into a string without having scanned it
> > previously is so rare it doesn’t merrit consideration.
> 
> Blindly indexing into a file without having scanned it
> previously is so common that you don't even remark on it
> happening.
> 
> A file, remember, is a string.

I don’t see how that is relevant. I can’t think of any common use
case where you’d be seeking blindly into a text file; though
plenty where you’d be seeking blindly within binary files.

But I’m not talking about heaps of bytes. Text is not the
same thing as a sequence of bytes, even though it’s stored in
one. A heap of bytes can be interpreted as text only if you know
what encoding it’s in. If you see 0x41 in a heap of bytes, that’s
only “capital A” because ASCII says so.

That doesn’t mean every heap of bytes should be encoded as if it
were a bunch of UTF-8 codepoint. It’s not, it’s a heap of bytes.
But text is a heap of characters, and text should be encoded into
a heap of bytes by way of the UTF-8 encoding.

Now if you’re processing text files, you don’t index blindly
anyway, you scan for newlines (cf. character-by-character
processing). And that works in UTF-8 as well as it always has
with ASCII: the extra bytes of a multibyte character never
conflict with the <127 codepoints, so you can just scan for 0x10.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>;
There's stuff above here

Generated at 20:00 on 17 Oct 2005 by mariachi 0.52