libera/#sicl - IRC Chatlog
Search
9:22:18
moon-child
though I am partial to fuzxxl's take on risc vs cisc: that the former is no longer extant
9:23:03
hayley
Apparently the textbook also states that, on average, you tend to achieve one cycle per instruction. That seems like a low bar with superscalar execution, and slightly optimistic without.
9:24:36
hayley
Henry Baker once said efficient run-time checks were harder to optimize on "untagged RISC architectures", but given that the lecturers state you need a language like C or C++ or FORTRAN to burn that well, I think it is a moot point. (:
9:26:11
hayley
We got to hear a war story about writing assembler to use both the U and V pipes on the Pentium 4 though.
9:32:53
beach
If the CISC is very regular, like the VAX, it can be a bit easier to write a compiler and to write directly in assembler.
9:34:05
beach
Also, on older CISCs, a register operation and a memory operation took rougly the same time, so there was not that issue back then.
9:37:28
hayley
Baker once wrote "Today's untagged RISC architectures put the burden on optimizing compilers to generate efficient run-time array-bounds and pointer checks", which is the only statement I had ever read suggesting compiling to a RISC was harder.
9:38:39
hayley
According to the lecturers, you still can only use C or FORTRAN to write sufficiently fast numeric code, and implementations of those languages do not do bounds checking. By that logic, there isn't anything hard about compiling to a RISC.
9:38:57
beach
So it is not the run-time checks that are optimized, but the code in general, due to the fact that more run-time checks are necessary.
9:45:33
hayley
To generate fast code. My classmates and I sometimes speak of "burning" through especially fast code.
9:46:35
hayley
ACTION is still thinking about assignment work, and hasn't yet switched contexts apparently.
9:47:43
beach
So was "well" meant to be an adverb, and "that" meant to be a relative pronoun not followed by a noun phrase?
9:48:21
hayley
If fast code can only be written in the aforementioned languages, and implementations of those languages don't perform bounds checking, then surely optimizing out bounds checks is not a large concern while writing a compiler.
9:49:32
hayley
"well" was used as an adverb, and I don't think "that" was used as a relative pronoun. It was used like "that" in "I can't ride a bicycle that well".
9:54:50
beach
We need to teach you how to write in a sufficiently unambiguous way that people like me (who require very precise phrases to get it) can understand your writing.
9:54:55
hayley
It was a strange (but predictable) thing to hear from the lecturers, given that I found my regular expression compiler (in pure Common Lisp, no SIMD trickery yet) to scan at a throughput of about 4 cycles per character.
9:56:16
beach
And in #commonlisp, you never answered my question about comparison with other programs for regular expressions. For performance I mean. I have no idea what would be considered "normal".
9:57:31
yitzi
beach: Seems like SICL has two printing packages, Incless and the mostly empty sicl-printer. Can you tell me a little bit about what is going on there. Incless defines pprint so that seems like where the pretty printer should go, but I suppose it could be put in its own package and sicl-printer fuse the two.
9:59:19
beach
It would depend a bit on how sophisticated the pretty-printer architecture needs to be.
10:00:58
beach
yitzi: My immediate reaction would be to make it a module, either in the same repository or in a separate one. Either way, I think they are fairly independent.
10:01:23
hayley
The main performance improvements are that I convert regular expressions to a DFA and compile the DFA to a single function, whereas cl-ppcre executes a chain of closures generated from a NFA, and I specialise compiled functions to array types.
10:02:09
beach
hayley: Sort of like the difference between the abandoned HIR compiler and the HIR evaluator.
10:02:15
hayley
Silly question, would there be a case for rewriting XP, like phoe did with Pitman's portable condition system, or is XP that bad?
10:03:27
beach
And I have the suspicion that it may need some facilities for client configuration, like our other libraries do.
10:07:07
hayley
I am sure my compiler could end up providing huge PROG* forms to the Common Lisp compiler if it tried to compile complex regular expressions.
10:11:26
hayley
So I've been looking into lightweight ways of cutting down code size (which have to be fast, else they might be slower than using the CL compiler for the same optimizations). One way is a modification to gilberth's derivative technique which makes "duplication" of submatch registers explicit, and avoids duplication where possible.
10:13:33
scymtym
hayley: regarding not being able to do anything about branching: pkhuong had this trick for runs of character: https://github.com/pkhuong/string-case/blob/master/string-case.lisp#L458 (link is to generated code, generator and rationale are in the same file)
10:14:24
hayley
The resulting equations for computing derivatives then sort of resemble Linear Lisp, or that sort of linear logic programming. So I call the IR "linear tagged regular expressions".
10:15:59
hayley
scymtym: Nice! I was considering something like it, as the SIMD search code would do something similar, but wasn't sure if it was a good idea. Thanks for the confirmation.
10:17:06
scymtym
hayley: sure. i don't know how applicable the technique is to unknown CHARACTER subtypes and portable code, though
10:17:43
scymtym
but i guess you could (maybe do) always compile a specific version for the CHARACTER subtype at hand
10:18:34
hayley
Right, I can see that (logxor (char-code a) (char-code b)) is not quite the same as (logxor a b). And we want the latter really.
10:19:47
hayley
I probably will generate code for each possible vector type which could store characters, but I suspect most will be SIMPLE-STRINGs.
10:32:00
hayley
It seems LLVM knows this trick, but neither I nor LLVM can think of a way to test if a character is equal to one of two other characters.
10:37:03
hayley
LLVM emits a CMP instruction, then loads the Equal flag into a register with SETE, then another CMP and SETE, then takes the logical OR of both flags. This procedure does also avoid branches though.
10:37:34
scymtym
there are two axis: simple vs. non-simple and the CHARACTER subtype. the latter determines how many characters can (at most) be tested in one step
10:39:48
scymtym
no, i don't think it is worth the effort. but you would probably have to handle different character widths anyway
10:41:15
scymtym
anyway, i'm not sure the technique is useful in your case. i just thought it might be of interest to you
13:26:01
beach
Bike: Wouldn't it be the case that most functions at the AST level would appear as the CALLEE-AST of a CALL-AST? And those functions can't escape then.
13:28:52
beach
I thought they would be in the paper, but I did a quick scan and couldn't find anything.
13:42:46
beach
For the others, I guess it is harder at the ATS level to track what becomes of them, since there is not the flow aspect of HIR.
13:46:01
beach
Well, this is not something I am planning to implement right now anyway, so I'll just keep it in mind.
13:46:34
beach
It would be interesting to do some quick performance comparisons between code executed by the AST evaluator and code executed by the HIR evaluator.
13:50:41
beach
What I should do now is to think about code generation. It appears that Cluster needs to handle something that is a bit like object code, i.e., in which certain references are not fully resolved. I am obviously not going to do that in the form of symbol tables stored in a file.
13:52:15
beach
I first thought I could get away with using Cluster "labels", but those are at the granularity of an instruction, so the patching code would then need to know where the operands of an instruction are located, which becomes messy.
13:53:35
beach
So I will try to work out in in-memory data structure for representing "object code" and some protocols for managing such code.
13:56:01
beach
The data structure could be as simple as a code vector of (unsigned-byte 8) and a list of pairs (identity . offset).
13:57:57
beach
There may have to be some size information too at some point, but I am pretty sure that we do only full-word patches, at least initially.
14:03:54
beach
I guess the code will always be only partially resolved because of call sites that can change at run time.
14:04:38
beach
So there will be no separate "object code" representation from the "fully resolved" representation.
14:10:26
beach
We probably also need to represent both "references" and "definitions" just like other object code does. So the simple idea with pairs might become a tad more complex.
14:16:17
beach
I think there are only two kinds of things to patch: 1. The target address of an unconditional jump, and 2. A full-word immediate value to load into a register.
14:23:45
beach
So I guess I will allow IMMEDIATE-OPERAND to take a value that is something other than an integer, and such an operand will match a descriptor with a 64-bit immediate operand. And I will allow for the unsigned jump instruction to contain a label that is not in the stream of Cluster instructions.
14:28:44
beach
Alternatively (for the immediate operand), I will have a subclasses of IMMEDIATE-OPERAND that include the type (8, 16, 32, 64 bit) and create a 64-bit operand with a value of 0. Then I will maintain a pointer to that operand and have it tracked as part of the translation procedure.
14:38:05
beach
Oh, wait. We concluded that there is no unconditional jump with an immediate 64-bit address, didn't we?
14:49:51
beach
Maybe we concluded that we need to free up a scratch register and use a 64-bit immediate load and a jump via a register.
14:51:58
beach
I can't figure out what "Jump near, absolute indirect, RIP = 64-bit offset from register or memory" means.
15:13:02
beach
So there is a hint where you can do "jmp far qword [rel next]" and then "next: resq ..."
15:16:13
beach
I take it to be a "far" jump so it uses a segment register, but perhaps there is a way to use the same as the current one.
15:26:03
beach
Either the RIP-relative indirect jump or the far jump seem to work, but I am not sure how.
15:28:04
Bike
well, as far as i understand, the "near" jump is relative to RIP, but the offset can be at most 32 bits. simple and should be adequate for jumping around within a function.
15:28:29
Bike
the indirect jumps load the address from wherever you specify. i don't see why that couldn't be the bytes immediately following the instruction, i guess?
15:51:13
beach
OK, so it looks like FF/4 JMP r/m64 can work. It's an indirect jump. The address in memory where the target address is located is specified with a modrm byte, so that could be specified to be the full word following the instruction. But it is not clear to me what the full word should contain, i.e., whether it is an absolute address, or an offset from some PC.
15:53:15
beach
I have read the relevant sections in the Intel manuals an uncountable number of times, like for implementing Cluster. But if a week goes by, I forget everything and have to read and understand it again.
15:53:39
Bike
in x86-64 there's no segmentation, so it should just be an absolute address, yeah? it says "absolute" in the table, too
15:54:53
beach
But that's a different instruction. That's the far jump. What I was just referring to is a near jump.
16:04:45
beach
OK, I think I got it (with the near indirect jump). Use a RIP-relative addressing which is Mod=00 R/M = 101 and a 32-bit displacement. The displacement is such that it points to the 64-bit word immediately following the instruction.