libera/#sicl - IRC Chatlog

9:21:39 moon-child I have never heard anybody say that it is easier to write a compiler for a cisc

9:22:18 moon-child though I am partial to fuzxxl's take on risc vs cisc: that the former is no longer extant

9:23:03 hayley Apparently the textbook also states that, on average, you tend to achieve one cycle per instruction. That seems like a low bar with superscalar execution, and slightly optimistic without.

9:24:36 hayley Henry Baker once said efficient run-time checks were harder to optimize on "untagged RISC architectures", but given that the lecturers state you need a language like C or C++ or FORTRAN to burn that well, I think it is a moot point. (:

9:26:11 hayley We got to hear a war story about writing assembler to use both the U and V pipes on the Pentium 4 though.

9:32:53 beach If the CISC is very regular, like the VAX, it can be a bit easier to write a compiler and to write directly in assembler.

9:34:05 beach Also, on older CISCs, a register operation and a memory operation took rougly the same time, so there was not that issue back then.

9:35:19 beach hayley: I have no idea how to parse your utterance that starts "Henry Baker...".

9:35:47 beach What does it even mean to "optimize run-time checks"?

9:36:24 beach And for what do you need C or C++? To do the run-time checks? To optimize them?

9:37:28 hayley Baker once wrote "Today's untagged RISC architectures put the burden on optimizing compilers to generate efficient run-time array-bounds and pointer checks", which is the only statement I had ever read suggesting compiling to a RISC was harder.

9:38:39 hayley According to the lecturers, you still can only use C or FORTRAN to write sufficiently fast numeric code, and implementations of those languages do not do bounds checking. By that logic, there isn't anything hard about compiling to a RISC.

9:38:57 beach So it is not the run-time checks that are optimized, but the code in general, due to the fact that more run-time checks are necessary.

9:39:08 hayley Right, my bad.

9:42:06 beach And what does it mean to "burn a well"?

9:44:08 beach I mean, usually wells contain water, and it doesn't make sense to burn one.

9:44:18 beach Maybe this is an expression I have never encountered.

9:45:11 beach And what does "that well" refer to? Run-time checks?

9:45:33 hayley To generate fast code. My classmates and I sometimes speak of "burning" through especially fast code.

9:46:35 hayley ACTION is still thinking about assignment work, and hasn't yet switched contexts apparently.

9:47:43 beach So was "well" meant to be an adverb, and "that" meant to be a relative pronoun not followed by a noun phrase?

9:48:21 hayley If fast code can only be written in the aforementioned languages, and implementations of those languages don't perform bounds checking, then surely optimizing out bounds checks is not a large concern while writing a compiler.

9:48:51 beach That makes sense.

9:49:32 hayley "well" was used as an adverb, and I don't think "that" was used as a relative pronoun. It was used like "that" in "I can't ride a bicycle that well".

9:49:47 beach AH!!!!!

9:50:39 beach So "burn" was used without an object. Got it.

9:51:43 hayley Whew!

9:54:50 beach We need to teach you how to write in a sufficiently unambiguous way that people like me (who require very precise phrases to get it) can understand your writing.

9:54:55 hayley It was a strange (but predictable) thing to hear from the lecturers, given that I found my regular expression compiler (in pure Common Lisp, no SIMD trickery yet) to scan at a throughput of about 4 cycles per character.

9:55:18 hayley Indeed ):

9:56:16 beach And in #commonlisp, you never answered my question about comparison with other programs for regular expressions. For performance I mean. I have no idea what would be considered "normal".

9:56:16 beach

9:56:49 hayley I never saw the message, but I think I can provide some measurements.

9:57:09 beach An order of magnitude would be fine.

9:57:31 yitzi beach: Seems like SICL has two printing packages, Incless and the mostly empty sicl-printer. Can you tell me a little bit about what is going on there. Incless defines pprint so that seems like where the pretty printer should go, but I suppose it could be put in its own package and sicl-printer fuse the two.

9:58:16 beach yitzi: The SICL package should go away in favor of Incless.

9:58:20 hayley It's somewhere between 2 and 15 times as fast as cl-ppcre.

9:58:34 beach hayley: Wow, very good!

9:58:54 beach yitzi: Whether the pretty printer should be part of Incless, I am not sure.

9:59:19 beach It would depend a bit on how sophisticated the pretty-printer architecture needs to be.

10:00:58 beach yitzi: My immediate reaction would be to make it a module, either in the same repository or in a separate one. Either way, I think they are fairly independent.

10:01:23 hayley The main performance improvements are that I convert regular expressions to a DFA and compile the DFA to a single function, whereas cl-ppcre executes a chain of closures generated from a NFA, and I specialise compiled functions to array types.

10:01:43 beach Yes, I see. Interesting.

10:01:59 yitzi beach: Ok, thanks.

10:02:09 beach hayley: Sort of like the difference between the abandoned HIR compiler and the HIR evaluator.

10:02:15 hayley Silly question, would there be a case for rewriting XP, like phoe did with Pitman's portable condition system, or is XP that bad?

10:02:35 beach I have been suggesting exactly that in #commonlisp.

10:02:47 beach The XP code is absolutely awful.

10:03:27 beach And I have the suspicion that it may need some facilities for client configuration, like our other libraries do.

10:04:04 yitzi Yes, a client architecture is what I am thinking also.

10:04:25 beach Sounds good.

10:04:44 beach GAH, I need a lunch break!!!!! Sorry folks. I'll be back later.

10:07:07 hayley I am sure my compiler could end up providing huge PROG* forms to the Common Lisp compiler if it tried to compile complex regular expressions.

10:11:26 hayley So I've been looking into lightweight ways of cutting down code size (which have to be fast, else they might be slower than using the CL compiler for the same optimizations). One way is a modification to gilberth's derivative technique which makes "duplication" of submatch registers explicit, and avoids duplication where possible.

10:13:33 scymtym hayley: regarding not being able to do anything about branching: pkhuong had this trick for runs of character: https://github.com/pkhuong/string-case/blob/master/string-case.lisp#L458 (link is to generated code, generator and rationale are in the same file)

10:14:24 hayley The resulting equations for computing derivatives then sort of resemble Linear Lisp, or that sort of linear logic programming. So I call the IR "linear tagged regular expressions".

10:15:59 hayley scymtym: Nice! I was considering something like it, as the SIMD search code would do something similar, but wasn't sure if it was a good idea. Thanks for the confirmation.

10:17:06 scymtym hayley: sure. i don't know how applicable the technique is to unknown CHARACTER subtypes and portable code, though

10:17:43 scymtym but i guess you could (maybe do) always compile a specific version for the CHARACTER subtype at hand

10:18:34 hayley Right, I can see that (logxor (char-code a) (char-code b)) is not quite the same as (logxor a b). And we want the latter really.

10:19:47 hayley I probably will generate code for each possible vector type which could store characters, but I suspect most will be SIMPLE-STRINGs.

10:32:00 hayley It seems LLVM knows this trick, but neither I nor LLVM can think of a way to test if a character is equal to one of two other characters.

10:37:03 hayley LLVM emits a CMP instruction, then loads the Equal flag into a register with SETE, then another CMP and SETE, then takes the logical OR of both flags. This procedure does also avoid branches though.

10:37:34 scymtym there are two axis: simple vs. non-simple and the CHARACTER subtype. the latter determines how many characters can (at most) be tested in one step

10:38:46 hayley Oh, are you saying I should load packed characters?

10:38:47 scymtym but BASE-CHAR strings are probably rare

10:39:48 scymtym no, i don't think it is worth the effort. but you would probably have to handle different character widths anyway

10:40:08 hayley Right.

10:41:15 scymtym anyway, i'm not sure the technique is useful in your case. i just thought it might be of interest to you

10:42:47 hayley It likely would be useful. Thanks for showing it to me.

13:19:01 beach yitzi: I added ctype as a dependency. Thanks for pointing that out.

13:19:32 yitzi np. Thank you.

13:26:01 beach Bike: Wouldn't it be the case that most functions at the AST level would appear as the CALLEE-AST of a CALL-AST? And those functions can't escape then.

13:27:45 beach I guess I should (re-)work out the conditions for inlining to be allowed.

13:28:52 beach I thought they would be in the paper, but I did a quick scan and couldn't find anything.

13:40:50 Bike most? well, depends on the characteristics of the code, i guess

13:41:11 Bike but probably

13:42:46 beach For the others, I guess it is harder at the ATS level to track what becomes of them, since there is not the flow aspect of HIR.

13:42:59 Bike right.

13:46:01 beach Well, this is not something I am planning to implement right now anyway, so I'll just keep it in mind.

13:46:34 beach It would be interesting to do some quick performance comparisons between code executed by the AST evaluator and code executed by the HIR evaluator.

13:50:41 beach What I should do now is to think about code generation. It appears that Cluster needs to handle something that is a bit like object code, i.e., in which certain references are not fully resolved. I am obviously not going to do that in the form of symbol tables stored in a file.

13:52:15 beach I first thought I could get away with using Cluster "labels", but those are at the granularity of an instruction, so the patching code would then need to know where the operands of an instruction are located, which becomes messy.

13:53:35 beach So I will try to work out in in-memory data structure for representing "object code" and some protocols for managing such code.

13:56:01 beach The data structure could be as simple as a code vector of (unsigned-byte 8) and a list of pairs (identity . offset).

13:56:23 beach Then there would be an operation to append two such things.

13:57:19 beach And of course to resolve an unresolved value, using an identity and a value.

13:57:57 beach There may have to be some size information too at some point, but I am pretty sure that we do only full-word patches, at least initially.

14:03:54 beach I guess the code will always be only partially resolved because of call sites that can change at run time.

14:04:38 beach So there will be no separate "object code" representation from the "fully resolved" representation.

14:05:16 beach Client code will be responsible for the code being operational when executed.

14:10:26 beach We probably also need to represent both "references" and "definitions" just like other object code does. So the simple idea with pairs might become a tad more complex.

14:16:17 beach I think there are only two kinds of things to patch: 1. The target address of an unconditional jump, and 2. A full-word immediate value to load into a register.

14:23:45 beach So I guess I will allow IMMEDIATE-OPERAND to take a value that is something other than an integer, and such an operand will match a descriptor with a 64-bit immediate operand. And I will allow for the unsigned jump instruction to contain a label that is not in the stream of Cluster instructions.

14:28:44 beach Alternatively (for the immediate operand), I will have a subclasses of IMMEDIATE-OPERAND that include the type (8, 16, 32, 64 bit) and create a 64-bit operand with a value of 0. Then I will maintain a pointer to that operand and have it tracked as part of the translation procedure.

14:29:03 beach That's nicer I think.

14:38:05 beach Oh, wait. We concluded that there is no unconditional jump with an immediate 64-bit address, didn't we?

14:40:29 beach But there is one where the target is is a memory address????

14:41:19 beach So can that memory address be the one immediately following the JMP instruction?

14:49:51 beach Maybe we concluded that we need to free up a scratch register and use a 64-bit immediate load and a jump via a register.

14:51:58 beach I can't figure out what "Jump near, absolute indirect, RIP = 64-bit offset from register or memory" means.

15:13:02 beach So there is a hint where you can do "jmp far qword [rel next]" and then "next: resq ..."

15:13:07 beach But I don't know what that means.

15:16:13 beach I take it to be a "far" jump so it uses a segment register, but perhaps there is a way to use the same as the current one.

15:18:14 beach *sigh*

15:21:35 Bike have you seen https://www.felixcloutier.com/x86/jmp ? has more detailed descriptions

15:22:44 beach Yes, it looks identical to the Intel manual actually, and that one I have.

15:23:10 beach I still don't understand it.

15:26:03 beach Either the RIP-relative indirect jump or the far jump seem to work, but I am not sure how.

15:28:04 Bike well, as far as i understand, the "near" jump is relative to RIP, but the offset can be at most 32 bits. simple and should be adequate for jumping around within a function.

15:28:29 Bike the indirect jumps load the address from wherever you specify. i don't see why that couldn't be the bytes immediately following the instruction, i guess?

15:29:18 beach Exactly.

15:51:13 beach OK, so it looks like FF/4 JMP r/m64 can work. It's an indirect jump. The address in memory where the target address is located is specified with a modrm byte, so that could be specified to be the full word following the instruction. But it is not clear to me what the full word should contain, i.e., whether it is an absolute address, or an offset from some PC.

15:53:15 beach I have read the relevant sections in the Intel manuals an uncountable number of times, like for implementing Cluster. But if a week goes by, I forget everything and have to read and understand it again.

15:53:39 Bike in x86-64 there's no segmentation, so it should just be an absolute address, yeah? it says "absolute" in the table, too

15:54:53 beach But that's a different instruction. That's the far jump. What I was just referring to is a near jump.

15:55:03 beach i.e., no segment register.

15:55:29 beach But I think either of those two will work.

16:04:36 Bike oh, sorry

16:04:45 beach OK, I think I got it (with the near indirect jump). Use a RIP-relative addressing which is Mod=00 R/M = 101 and a 32-bit displacement. The displacement is such that it points to the 64-bit word immediately following the instruction.