Can someone explain why parsers can't handle nested selectors which don't start with a symbol?
I've written plenty of parsers (recursive descent, packrat, Pratt, and using generators) and not a single one would have any trouble parsing something like:
html {
body:has(p) {
width: 1000px;
}
}
The only thing I can think of is that they are trying to avoid a lookahead, i.e. you have selectors like "body:has(p) {" which initially look like they could be setting a "body" attribute to the value "has" until you reach the "(". But these lookaheads aren't hard to implement in practice. There are performance issues if the lookahead has to go too deep, but CSS developers can use the & in those cases as an optimization, and you could limit lookahead depth to something like 256 (which would handle the vast majority of non-malicious use cases).
Perhaps the lookahead has a lot more drastic effect than I'm expecting on performance for short lookaheads, but I'd like to at least see some profiling of that, as it seems to me like this possibility was discarded without much consideration or explanation (it wasn't even included in the poll).
There's also another way to do this without lookaheads at all, by building up a structure for each potential path and then discarding the one that doesn't complete (I think this is basically a DFA but it's been a while since I read literature on this so I'm forgetting the terminology). This would probably avoid the performance issues but also probably be a larger departure from the current implementation of the parsers.
See https://github.com/w3c/csswg-drafts/issues/7961, which details some of the problems, including genuine and currently irreconcilable ambiguities, and goes into attempting to make it work by adding convoluted and complex rules and arbitrary lookahead in a way that I really strongly hope gets nixed because it’s awful. (And they’re really only trying to do this because they resolved to make & optional in #7834, which I think was where things started going wrong.)
Sorry, all the irreconcilable ambiguities are conflicts with other, not-yet-accepted proposals, with one exception, a conflict with JS tools that use JSON to spit out CSS. I'd argue that there are other syntax options for those other not-yet-accepted proposals, and this is likely to be a more-used feature than any I saw. And forcing some poorly-designed JS tools to come up with a minor workaround just isn't something I care about.
Is there any conflict with anything that exists in the standard right now?
The parsing time issue is "in the standard now". Even if no rule today matches `property: value { something in brackets }`, that's still at least three tokens (`property`, ':', 'value') to read before bailing out at a bad "property" because values "shouldn't" have things in brackets. (You can build examples with complex CSS selectors where it becomes way more than 3 tokens, as well.)
CSS was designed to be "forgiving" of malformed input, so there's generally no "early" bailout the parser can make, even if it doesn't think it knows which property you are talking about and has a whitelist of specific properties it supports.
Indeed, the reason is simply to keep the current look-ahead(1) for the CSS syntax. If you allow nested rules with selector list which doesn't start with a symbol, you need (in theory) an unbounded look-ahead.
However, the current spec has been written in a way that it doesn't prevent removing this restriction in the future. But the CSSWG would prefer to have some real world profiling data to know the actual performance cost of removing this restriction.
I've written plenty of parsers (recursive descent, packrat, Pratt, and using generators) and not a single one would have any trouble parsing something like:
The only thing I can think of is that they are trying to avoid a lookahead, i.e. you have selectors like "body:has(p) {" which initially look like they could be setting a "body" attribute to the value "has" until you reach the "(". But these lookaheads aren't hard to implement in practice. There are performance issues if the lookahead has to go too deep, but CSS developers can use the & in those cases as an optimization, and you could limit lookahead depth to something like 256 (which would handle the vast majority of non-malicious use cases).Perhaps the lookahead has a lot more drastic effect than I'm expecting on performance for short lookaheads, but I'd like to at least see some profiling of that, as it seems to me like this possibility was discarded without much consideration or explanation (it wasn't even included in the poll).
There's also another way to do this without lookaheads at all, by building up a structure for each potential path and then discarding the one that doesn't complete (I think this is basically a DFA but it's been a while since I read literature on this so I'm forgetting the terminology). This would probably avoid the performance issues but also probably be a larger departure from the current implementation of the parsers.