freenode/#lisp - IRC Chatlog
Search
14:22:19
pjb
Murii: also, if your language supports unicode, it may not be the best choice to make strings vectors.
14:23:01
jackdaniel
technically string is a set of symbols from a specified alphabet, no vector required
14:23:18
pjb
Since the properties of unicode strings are so strange, vectors of glyfs, vectors of code-points, or vectors of characters are all inconvenient. Some argue that even having characters is inconvenient, when you use unicode.
15:07:19
TMA
hajovonta: if you restrict yourself to English the intuition is fine. but for more complex languages you soon run full speed into a wall with it
15:09:48
Shinmera
Often times a unicode string will be taken as a vector of code points. Unicode allows you to compose characters through multiple individual code points, where the order does not necessarily matter. Thus, as a representation of a "character string" different vectors represent the same thing, but are no longer identical.
15:10:13
TMA
because the properties of sequences no longer hold -- if you concatenate sequences you would probably assume that the resulting length will be the sum of lengths
15:12:01
hajovonta
we have several strange characters in our language (Hungarian) like á, é, í, ú, and ö, ő, ü, ű
15:12:35
hajovonta
but, when I write "árvíz" and concatenate "tükör" then "árvíztükör" is a sequence of 10 characters
15:13:16
Shinmera
Unicode has code points for just the ticks, so you could write a+tick, as two code points, which would be a string of length 2 in most implementations.
15:13:44
TMA
hajovonta: say that you concatenate 'hajov' and 'onta' in a CV sylabic script ... the first has 3 characters (ha-jo-v) the second has 3 (o-n-ta) ; but the concatenation has 5 not 6 (ha-jo-vo-n-ta)
15:15:22
hajovonta
but it doesn't matter how many code points make up a character. We can just count the characters, can't we?
15:17:58
TMA
hajovonta: why yes. sometimes. in other times what constitutes a character is hard to tell. ... IIRC, in Spanish ll is a single character though two codepoints. ch is a single character in Czech
15:19:35
TMA
hajovonta: so when you see the string "chill", you do not know how many characters you have
15:20:35
dlowe
generally, when you want to know the "length" of the string, you want to know a) how many bytes does it take up or b) how many pixels will it take to render this string, both of which have satisfactory answers.
15:21:28
dlowe
A perverse mind might want to know it in order to find the maximum valid index of a character.
15:23:03
hajovonta
15:23:02 - jackdaniel: technically string is a sequence of symbols from a specified alphabet, no vector required
15:24:04
hajovonta
based on jackdaniel's definition, a string is a sequence of symbols from a specified alphabet
15:24:08
jackdaniel
characters are symbols of "text" alphabet, and that's what usually languages implement
15:24:16
pjb
hajovonta: unicode strings are decomposed in glyps, that are defined by a sequence of code-points of variable length.
15:24:29
hajovonta
I agree that the number of characters can be different when using different alphabets
15:24:46
pjb
hajovonta: the notion of string as a sequence of character would imply that a character is an object of variable length.
15:25:26
pjb
hajovonta: no implementation implement characters this way, because it makes for very complex objects, compared to the usual single byte for C char, or 32-bit word for unicode code-point.
15:26:21
jackdaniel
(what I meant, that one glyph is element of the alphabet, not its code-points, so it is irrevelant to the definition)
15:26:29
pjb
There's also the problem of normalization, such as the various unicode representation of á, or the problem of ligatures.
15:26:58
pjb
jackdaniel: ok, you're implementor. I dare you to implement ecl characters as variable-length sequences of code points…
15:29:00
pjb
hajovonta: by the way, even without going full unicode, just with ascii, you have the distinction between characters or ASCII control codes.
15:29:14
jackdaniel
yes, and when you say (nth *string* 18) it won't take 18-th code-point, but 18-th character
15:29:34
pjb
hajovonta: usually implementations make strings vectors of characters, with virtual characters corresponding to ascii control codes, which has no meaning.
15:30:39
jackdaniel
right, the only important thing is that the sequence is finite, alphabet may have infinite number of possible symbols
15:30:43
pjb
hajovonta: so far, even with unicode, it's finite (and way bigger than 2^21), but just let the user combine the code-points without limiting the number of combinaisons, and you get an infinite number of characters!
15:31:28
pjb
hajovonta: in any case, your question is irrelevant: it could indeed (and should IMO) be done that way, but it is not done in practice by implementations!
15:33:20
dlowe
which makes it impossible to both support the notion of a unicode character and the CL spec.
15:36:05
dlowe
pjb: sure, but you'll need some way to access them without decomposing them on the fly.
15:37:25
hajovonta
but a character can be anything, like "djshfkjdhskh". In a hypothetical alphabet, this can be one character.
15:38:11
pjb
basically, IIRC, you can have up to ten combining code-points following a non-combining code-points.
15:39:32
pjb
This would have the advantage, that you could represent most common characters as fixnums.
16:30:45
flip214
pjb: fixnums are awfully large... even the 32bit character on 64bit machines hurts, if you need to store some larger text body
16:36:10
pjb
If you have large anything, you need to consider your own data structures and algorithms.
16:36:28
pjb
And indeed, the sequence of character representation of large body of text is not often the best one.
16:36:52
pjb
See for example, lisp source code: it's read and not represented as strings, but instead as sexps!
16:37:22
foom
The problem is that it's not the best representation for a small body of text either, except where "best" is defined as "works within existing standard".
16:37:29
pjb
If you had to read wikipedia, probably you'd start by storing words instead of characters… And they perhaps you'd even try to store relationships infered from the sentences…
16:38:51
foom
What you really want is to store the text as utf8, and provide APIs to iterate over encoded bytes, codepoints, grapheme clusters, words, etc.
21:10:01
pierpa
Peter Norvig just made a pdf of PAIP available for free. No more excuses for not studying it.
21:13:52
sjl
I already have it from the ACM archive, but really nice that it's free for everyone now
21:15:25
pierpa
He says: "The .txt version has a lot of errors; I got it from the default Save as other / ...Text menu item in Acrobat. An automated tool could rejoin the lines that end in hyphens, and perhaps find missing spaces, as in programmingpractices and anunfortunate. Other errors would require significant human labor to clean up."
21:16:51
Xach
pierpa: it never occurred to me to subscribe to the repo; why would it ever change? but that's awesome.
21:21:43
sjl
looks like it's partially OCR'ed from a scan, and the scanned words replaced with text in some font
21:29:11
pierpa
what about setting up a group of volunteers for fixing the .txt? split the file in small chunks, distribute the chunks to volunteers, etc...
21:43:18
pierpa
"Elsevier has reverted the copyright on the book to the author (me, Peter Norvig), so we are now free to do with it what we want. Robert Smith, @tarballs-are-good, is interested in putting in some work towards this end."
21:56:39
mishoo_
heh, found that exact PDF years ago (via thepiratebay, iirc). typographical quality isn't great :-/ but the content is gold
21:57:23
Xach
scymtym: just a feel. when it finishes i'll have a better idea of whether the feeling is correct.
22:03:42
sjl
the ACM's is a scan of the book. It's OCRed and searchable/copyable, but they didn't actually replace the image of the scan with text like the version in the repo
23:22:08
jasom
if # is a non-terminating macro character why does sbcl print :foo#bar as :|FOO#BAR| and slime higlight :foo#bar as different words?
23:24:02
jasom
pjb: my question is why sbcl puts spurious || around it and slime highlights it incorrectly. I agree it doesn't terminate the token
23:24:07
pjb
If it was terminating, say, like ', then in foo'bar the quote terminates the foo token, and then a further read will read 'bar ( (quote bar) ).
23:24:31
pjb
jasom: the printing is in part implementation dependent, and in part directed by the *print-…* variables.
23:25:03
jasom
pjb: I'm not saying sbcl is doing something wrong, but the fact that it chooses to escape tokens with # in them make me wonder if I ought to do so in my code
23:27:21
Shinmera
jasom: 22.1.3.3 seems to imply that it's allowed to do this, even if there's no strict need to.
23:28:06
jasom
pjb: I know how print-escape works. I was merely expressing a concern that the sbcl devs know something I don't with regard to internal # in symbol names
23:28:53
pjb
jasom: theorically, they would have to check in the read table whether a character is a terminating macro character or not. Instead if you systematically escape, you can print faster!
23:31:05
jasom
pjb: that makes no sense because it has to check all the alphabetic characters which are much more common
23:32:23
jasom
that would make sense because # is the only non-terminating macro character in the standard readtable
23:52:39
sjl
The first reads as the symbol nil, the second reads as the list (quote nil). When evaluated they result in the same thing, because nil is special and evaluates to itself.
23:59:56
pjb
(defpackage "MY-NULL" (:export "NIL") (:use)) (defconstant my-null:nil 0) (let ((*package* (find-package "MY-NULL"))) (list (eval (read-from-string "'nil")) (eval (read-from-string "nil")))) --> (my-null:nil 0)
0:00:28
pjb
energizer: on the other hand: (let ((*package* (find-package "MY-NULL"))) (list (eval (read-from-string "'()")) (eval (read-from-string "()")))) --> (nil nil)
0:00:28
stacksmith
Nil is special: it's kind of like a keyword - a subtype of symbol that evaluates to itself. It is also considered a list with 0 items (listp nil) => t
0:00:45
pjb
energizer: but in this case, it depends on *readtable* where, the reader macros for ' and ( are defined.
0:01:07
pjb
energizer: you could change those reader macro to read something else than CL:QUOTE and CL:NIL.
0:03:54
stacksmith
form n. 1. any object meant to be evaluated. 2. a symbol, a compound form, or a self-evaluating object. 3. (for an operator, as in ``<<operator>> form'') a compound form having that operator as its first element. ``A quote form is a constant form.''
2:01:41
stylewarning
I’m the one who helped get copyright reverted, and it looks like elsevier might have lost the source