-
Notifications
You must be signed in to change notification settings - Fork 7
CWG3015 [lex.pptoken] Implementation divergence in _header-name_ tokenization for __has_include
#690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We want to keep the lexer as simple as possible, thus I like the proposed resolution. (off topic: If we ever get f-literals, a "simple lexer" will likely be a thing of the past.) The examples offered here that would be better handled by an alternative approach feel pathologically contrived. |
__has_include
__has_include
The new text (and the parallel text it originated from) should at most be a note now.
It is normatively redundant or conflicting:
|
Hm... Interesting.
gcc and clang believe that no header-name token is formed "early", consistent with the current wording (the special rule for So, yes, the "second form" rules should be struck for |
The wording in the Working Draft leads to the logic that either the
|
Agreed. I've removed those "second form" rules as redundant with the grammar. |
With the rewrite, we always get UB from:
because the replacement produces a series of preprocessing tokens that would not consist of a single header-name. The previous wording relied on an implicit interpretation of the resulting preprocessing tokens as raw source characters. |
For info: Implementations do not form header-name tokens when #define IDENT(X) X
#define stdio nosuch
#if IDENT(__has_include(<stdio.h>))
#else
#error No header-name formed!
#endif |
I thought that was addressed by this sentence:
which conveys the expectation that the character sequence is merged into a header-name somehow. Do you want to make this more explicit / more integrated in the "go back to the first form" delegation?
With the new wording, I think those implementations are non-conforming, because [lex.pptoken] considers only preprocessing tokens (in phase 3), not phase 4 grammar entities such as has-embed-expressions. Anywhere you have |
Yes. Also, the status quo technically establishes UB before any implementation-defined formation of header-name tokens (e.g., if there is a raw string containing a newline, UB exists).
That's an area that lacks clarity (but not too much of a problem). The current interpretation taken by implementations is that An observation with the new wording is that, under the ill-formed only past macro-expansion interpretation, it is now possible for function-like macro arguments to contain header-name tokens. This introduces a need to update https://wg21.link/cpp.stringize. |
I've (hopefully) addressed that: CWG3015
I'm not seeing that.
This seems to apply to header-name preprocessing tokens all the same. |
See:
The spelling of header-name preprocessing tokens can also contain |
Right. Fixed: CWG3015 |
I think it needs an additional:
because the implementation-defined method could "succeed" in forming an ill-formed header-name. An example of such header-names in the wild include
|
I don't understand. The suggested wording says "an attempt is made to form..." Obviously, something having newlines in it is not a header-name per the lexer grammar in [lex.header], thus the "attempt to form" fails, we retain the non-header-name preprocessing tokens, and the match against the first form fails. Saying "if a template-head is not a syntactically valid template-head" feels very much redundant. |
Let's try an alternative scenario. While I am unaware of any implementation that does so, under the new wording the "implementation-defined method" can take a string I guess my wording attempt does not cover this case though. |
I'd be happy to remove the "implementation-defined method"; I don't know what it's intended to do other than allow strange implementations such as the one you offered in your last comment. So, we're looking at
except that [lex.header] p2 says the backslash is conditionally-supported. However, that paragraph does not seem to permit converting |
I believe it is there to allow for differences in the treatment of whitespace between tokens. Alisdair in https://wg21.link/p2843 ascribed some behaviour where the source of the tokens were significant to this implementation-definedness, but I suspect implementation bugs. I believe the condition we want to apply is that the characters sequence of the spellings of the tokens must syntactically be a header-name. |
I also noticed we don't actually branch off to the first form; we just say "must of of that form", but we never said "and then we do as in the first form". I now use "characters of the spellings", which is wording already existing in [cpp.stringize]. We still "attempt to form" a header-name from those characters; if we don't get a match, the formation fails and we bail (UB or ill-formed). Otherwise, we re-process as in the first form. |
@jensmaurer, under the quoted interpretation, having Accepted by implementations (https://godbolt.org/z/TsxascPM3): #define stdio nosuch
#if __has_embed(__FILE__ limit(\
__has_include(<stdio.h>) ? 0 : 1\
)) == __STDC_EMBED_EMPTY__
#error Header name formed!
#endif |
To avoid implementation-defined shenanigans regarding a match, I suggest:
|
Yes. Turns out it's not a header-name in this slight alteration:
I find this mildly odd, but it's consistent with Wording in [lex.pptoken] fixed: CWG3015
If the implementation leeway is just for the (non-)consideration of whitespace, we should directly say so.
This is broken anyway, because a header-name is the entire thing including the delimiters (< or "), but this phrasing makes it sound like we're only dealing with the stuff inside the delimiters. |
Full name of submitter (unless configured in github; will be published with the issue): Hubert Tong
Reference (section label): lex.pptoken, lex.phases, cpp.cond
Link to reflector thread (if any): N/A
Issue description:
The specification for
__has_include
is unclear on how any potential tokenization as a header-name occurs relative to has-include-expression syntax matching and macro expansion.Consider: https://godbolt.org/z/ThK75Evb1
Given that, before macro expansion, its is potentially unknown whether
<stdio.h>
is within a has-include-expression (e.g., ifEMPTY
expands to<iostream>) IGNORE(
), it is unclear whether https://wg21.link/lex.pptoken#4.3.2 specifies that<stdio.h>
in the above forms a header-name. If we take phase 3 of [lex.phases]/1 as separate from phase 4, then a potential interpretation is that<stdio.h>
in the above is tokenized as a header-name regardless of whether it is part of a has-include-expression after macro expansion because it appears after__has_include(
and before a paired)
. It is also a valid interpretation that<stdio.h>
is not tokenized as a header-name because it is not the immediately after__has_include(
.There is implementation divergence: Clang and both MSVC preprocessor implementations form a header-name; GCC and EDG do not. It actually seems that Clang and MSVC do not maintain proper separation between phase 3 and phase 4.
Suggested resolution:
Replace https://wg21.link/lex.pptoken#4.3.2 to say:
See also https://wg21.link/CWG2190 (which may be NAD).
The text was updated successfully, but these errors were encountered: