libera/#sicl - IRC Chatlog

1:04:27 hayley Some parts are used in the Clasp implementation.

1:12:53 whereiseveryone cool

1:19:02 Bike other (spun-off) parts are used in other projects, e.g. staple uses eclector-cst for marking up code.

1:19:44 whereiseveryone what is staple?

1:19:48 Bike and yes, cleavir has a million asds because the systems are independently loadable/not loadable.

1:19:48 whereiseveryone is it on github?

1:20:00 Bike a documentation generator https://github.com/Shinmera/staple

1:20:09 whereiseveryone re asds: got it cool

1:29:23 Bike for example someone might want to parse code into ASTs to do syntactic analysis, but isn't interested in the later IR stage.

5:09:14 beach Bike: Congratulations to the release!

5:10:19 beach I didn't realize Staple uses Eclector. That's nice.

7:12:51 beach Here is an idea for finding a symbol with a particular name in the CL package: We compute the length of the name as pkhuong suggests. Then we do a "binary search" similar to what we do with fast generic dispatch. We branch so that around half of the different length of the names of the CL package are in one branch, and around half in the other.

7:12:57 beach This is not the same as branching on half the max length, because there are very likely lengths that don't exist at all. This step will take around lb N steps where N is the number of different lengths.

7:13:07 beach Then we can go from left to right in the candidate names, because we need to check every character in the input anyway. If there is a single character in the current position in the candidate names, we just check that it's the same as in the input.

7:13:08 beach If there is a small number of possible letters, just do one or more IFs. If there is a large number of possibilities, then do a binary search on the character code of the input.

7:14:05 beach I suspect that the number of cases where there are a lot of possible characters is small.

7:14:45 beach If there are M possible characters, the expected number of tests is M/2.

7:15:03 beach ... in a simple sequence of tests, I mean.

7:15:53 beach With binary search, you always get lb M tests, so for binary search to pay off lb M must be smaller than M/2.

7:16:11 beach So M must be greater than 4.

7:17:20 beach The number of such cases can be determined by examining all CL symbols in each length category.

7:18:09 beach Oh, and sequential test can be made faster by making the test for more likely symbols appear earlier.

7:22:43 beach Oh, wait, we don't go left to right in each length category...

7:22:59 beach We test the characters in an order that will minimize the number of tests.

7:23:30 beach This smells like an intractable problem. But the domain is small, and the computation is done once and for all.

7:25:16 Mondenkind the CL package has O(1024) symbols, so you need at least 10 tests for uniformly distributed symbols. As you say, you may do somewhat better on average by taking into account frequency, but it's still a branchfest that walks all over memory. And the branches are probably less predictable than for generic functions

7:26:12 Mondenkind not to mention the second-order cost of branches (alias/evict good branches, mispredicts inhibit parallelisation with surrounding code) and memory accesses (alias/evict good memory)

7:26:24 beach Sure, it is not cheap. The question is whether it is better than some other strategy like hashing.

7:26:28 Mondenkind so I really don't see this beating a good hash table

7:26:39 beach I understand.

7:29:04 beach So to figure that out, I think we must code it up and compare.

7:29:11 Mondenkind one thing which may be interesting is to store the character data directly inside of the hash table, up to some limit, to improve spatial locality

7:29:35 Mondenkind so you get both the data and the symbol on the same cache line. Do that up to, say, 15 ascii characters

7:31:17 beach So I suggest you code up your best solution, and I'll code up my idea. OK?

7:32:39 Mondenkind all right. The challenge is: implement INTERN for CL?

7:32:51 beach I suppose so, yes.

7:33:09 beach I guess more like FIND-SYMBOL.

7:33:31 Mondenkind right

7:54:30 beach Actually, a binary search must end with an equality test, so 1 + lb M must be smaller than M/2. So if my computation is right, that makes M > 8

7:58:38 beach So if we take symbols with a name that has 6 characters in it (that's the category with the most number of symbols in it, namely 91), there are 18 different first letters which suggests a branch on the char-code. But then, it might be less advantageous to do a binary search for each branch.

8:00:01 Mondenkind you could test multiple characters at once--both when binary and linearly searching--with swar tricks

8:01:27 Mondenkind probably most advantageous with the latter, since it means you have to do fewer double-checks, but for the former, it allows for more precise cuts

8:02:49 beach That's an interesting idea.