Skip to content

CWG3015 [lex.pptoken] Implementation divergence in _header-name_ tokenization for __has_include #690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hubert-reinterpretcast opened this issue Mar 23, 2025 · 22 comments

Comments

@hubert-reinterpretcast
Copy link
Collaborator

Full name of submitter (unless configured in github; will be published with the issue): Hubert Tong

Reference (section label): lex.pptoken, lex.phases, cpp.cond

Link to reflector thread (if any): N/A

Issue description:
The specification for __has_include is unclear on how any potential tokenization as a header-name occurs relative to has-include-expression syntax matching and macro expansion.

Consider: https://godbolt.org/z/ThK75Evb1

#define EMPTY
#define IGNORE(X)
#define stdio nosuch
#if __has_include(EMPTY <stdio.h>)
#error Hello! Header name formed!
#endif

Given that, before macro expansion, its is potentially unknown whether <stdio.h> is within a has-include-expression (e.g., if EMPTY expands to <iostream>) IGNORE(), it is unclear whether https://wg21.link/lex.pptoken#4.3.2 specifies that <stdio.h> in the above forms a header-name. If we take phase 3 of [lex.phases]/1 as separate from phase 4, then a potential interpretation is that <stdio.h> in the above is tokenized as a header-name regardless of whether it is part of a has-include-expression after macro expansion because it appears after __has_include( and before a paired ). It is also a valid interpretation that <stdio.h> is not tokenized as a header-name because it is not the immediately after __has_include(.

There is implementation divergence: Clang and both MSVC preprocessor implementations form a header-name; GCC and EDG do not. It actually seems that Clang and MSVC do not maintain proper separation between phase 3 and phase 4.

Suggested resolution:
Replace https://wg21.link/lex.pptoken#4.3.2 to say:

  • in phase 3 of translation, immediately after a preprocessing token sequence of __has_include followed immediately by (.

See also https://wg21.link/CWG2190 (which may be NAD).

@jensmaurer
Copy link
Member

jensmaurer commented Mar 23, 2025

We want to keep the lexer as simple as possible, thus I like the proposed resolution. (off topic: If we ever get f-literals, a "simple lexer" will likely be a thing of the past.)

The examples offered here that would be better handled by an alternative approach feel pathologically contrived.

@jensmaurer
Copy link
Member

CWG3015

@jensmaurer jensmaurer changed the title [lex.pptoken] Implementation divergence in _header-name_ tokenization for __has_include CWG3015 [lex.pptoken] Implementation divergence in _header-name_ tokenization for __has_include Mar 23, 2025
@hubert-reinterpretcast
Copy link
Collaborator Author

@jensmaurer,

The new text (and the parallel text it originated from) should at most be a note now.

The second form of has-embed-expression is considered only if the first form does not match, in which case the parenthesized preprocessing tokens are processed as if they were the pp-tokens in the second form of a #embed directive (15.4 [cpp.embed]).

It is normatively redundant or conflicting:

  • There is no grammar ambiguity between the two forms since the choice of preprocessing token occurs in normative text elsewhere.
  • https://wg21.link/cpp.cond#11 specifies macro replacement without regard to the delimiters of has-include-expression‌s, etc.
  • The identification of the preprocessing tokens corresponding to a has-include-expression, etc. can be taken to happen as late as https://wg21.link/cpp.cond#12 (this should be clarified).

@jensmaurer
Copy link
Member

jensmaurer commented Mar 23, 2025

The identification of the preprocessing tokens corresponding to a has-include-expression, etc. can be taken to happen as late as https://wg21.link/cpp.cond#12 (this should be clarified).

Hm... Interesting.

#define X __has_include ( <stdio.h> )   // does this form a header-name here?
#define stdio nonesuch              // ... in which case this #define would have no effect...
#if !X                                       // ... here
#error no <stdio.h> 
#endif 

gcc and clang believe that no header-name token is formed "early", consistent with the current wording (the special rule for __has_include( only applies in a #if or #elif directive).

So, yes, the "second form" rules should be struck for __has_include and __has_embed.

@hubert-reinterpretcast
Copy link
Collaborator Author

gcc and clang believe that no header-name token is formed "early", consistent with the current wording (the special rule for __has_include( only applies in a #if or #elif directive).

The wording in the Working Draft leads to the logic that either the __has_include token is part of a has-include-expression and the header-name token has to be formed or the __has_include token is not part of a has-include-expression and the program is ill-formed under https://wg21.link/cpp.cond#8:

The identifiers __has_include and __has_cpp_attribute shall not appear in any context not mentioned in this subclause.

@jensmaurer
Copy link
Member

Agreed. I've removed those "second form" rules as redundant with the grammar.
I've also added a half- sentence in [cpp.cond] that prevents another round of macro replacement for __has_embed.

CWG3015

@hubert-reinterpretcast
Copy link
Collaborator Author

@jensmaurer,

With the rewrite, we always get UB from:

If the directive resulting after all replacements does not match the previous form, the behavior is undefined.

because the replacement produces a series of preprocessing tokens that would not consist of a single header-name.

The previous wording relied on an implicit interpretation of the resulting preprocessing tokens as raw source characters.

@hubert-reinterpretcast
Copy link
Collaborator Author

For info: Implementations do not form header-name tokens when __has_include( appears with the arguments of a function-like macro. Following the logic from #690 (comment), the implementations are non-conforming because they produce no diagnostic.

#define IDENT(X) X
#define stdio nosuch
#if IDENT(__has_include(<stdio.h>))
#else
#error No header-name formed!
#endif

https://godbolt.org/z/E8vzEx1Wq

@jensmaurer
Copy link
Member

jensmaurer commented Mar 23, 2025

With the rewrite, we always get UB from: ...

I thought that was addressed by this sentence:

The method by which a sequence of preprocessing tokens between a < and a > preprocessing token pair or a
pair of " characters is combined into a single header name preprocessing token is implementation-defined.

which conveys the expectation that the character sequence is merged into a header-name somehow.

Do you want to make this more explicit / more integrated in the "go back to the first form" delegation?

Implementations do not form header-name tokens when __has_include( appears with the arguments of a function-like macro.

With the new wording, I think those implementations are non-conforming, because [lex.pptoken] considers only preprocessing tokens (in phase 3), not phase 4 grammar entities such as has-embed-expressions. Anywhere you have __has_include ( something, "something" is converted into a header-name if possible. As a second check, __has_include might be ill-formed when it appears in an untoward location.

@hubert-reinterpretcast
Copy link
Collaborator Author

Do you want to make this more explicit / more integrated in the "go back to the first form" delegation?

Yes. Also, the status quo technically establishes UB before any implementation-defined formation of header-name tokens (e.g., if there is a raw string containing a newline, UB exists).

As a second check, __has_include might be ill-formed when it appears in an untoward location.

That's an area that lacks clarity (but not too much of a problem). The current interpretation taken by implementations is that __has_include is fine in macro definitions and during macro expansion and is only a problem if it survives macro expansion to end up "in the wrong place".

An observation with the new wording is that, under the ill-formed only past macro-expansion interpretation, it is now possible for function-like macro arguments to contain header-name tokens. This introduces a need to update https://wg21.link/cpp.stringize.

@jensmaurer
Copy link
Member

Do you want to make this more explicit / more integrated in the "go back to the first form" delegation?

I've (hopefully) addressed that: CWG3015

This introduces a need to update https://wg21.link/cpp.stringize.

I'm not seeing that.

If, in the replacement list, a parameter is immediately
preceded by a # preprocessing token, both are replaced by a single character string literal preprocessing token
that contains the spelling of the preprocessing token sequence for the corresponding argument (excluding
placemarker tokens).

This seems to apply to header-name preprocessing tokens all the same.

@hubert-reinterpretcast
Copy link
Collaborator Author

This seems to apply to header-name preprocessing tokens all the same.

See:

Otherwise, the original spelling of each preprocessing token in the stringizing argument is retained in the character string literal, except for special handling for producing the spelling of string-literals and character-literals: a \ character is inserted before each " and \ character of a character-literal or string-literal (including the delimiting " characters).

The spelling of header-name preprocessing tokens can also contain \ and " characters.

@jensmaurer
Copy link
Member

Right. Fixed: CWG3015

@hubert-reinterpretcast
Copy link
Collaborator Author

I've (hopefully) addressed that: CWG3015

I think it needs an additional:

[ ... ] or if the header-name formed is not a valid header-name, the behavior is undefined.

because the implementation-defined method could "succeed" in forming an ill-formed header-name.

An example of such header-name‌s in the wild include

<"R(
)">

@jensmaurer
Copy link
Member

I don't understand. The suggested wording says "an attempt is made to form..." Obviously, something having newlines in it is not a header-name per the lexer grammar in [lex.header], thus the "attempt to form" fails, we retain the non-header-name preprocessing tokens, and the match against the first form fails.

Saying "if a template-head is not a syntactically valid template-head" feels very much redundant.

@hubert-reinterpretcast
Copy link
Collaborator Author

hubert-reinterpretcast commented Mar 24, 2025

thus the "attempt to form" fails, we retain the non-header-name preprocessing tokens, and the match against the first form fails

Let's try an alternative scenario. While I am unaware of any implementation that does so, under the new wording the "implementation-defined method" can take a string "\"" and give you <">. Under the old wording, there was unavoidable UB.

I guess my wording attempt does not cover this case though.

@jensmaurer
Copy link
Member

jensmaurer commented Mar 24, 2025

under the new wording the "implementation-defined method" can take a string """ and give you <">. Under the old wording, there was unavoidable UB.

I'd be happy to remove the "implementation-defined method"; I don't know what it's intended to do other than allow strange implementations such as the one you offered in your last comment.
(I'd like to point out that the "implementation-defined method" wording exists in the status quo.)

So, we're looking at

#define S "\""   // string-literal
#include S        // undefined behavior, because the _header-name_ is "\"  (single backslash within double-quotes) followed by a lone "

except that [lex.header] p2 says the backslash is conditionally-supported. However, that paragraph does not seem to permit converting S (in the example above) to a header-name representing a source file whose name is " (because lexing the header-name must stop at the second ", regardless of any backslashes).

@hubert-reinterpretcast
Copy link
Collaborator Author

I'd be happy to remove the "implementation-defined method"; I don't know what it's intended to do other than allow strange implementations such as the one you offered in your last comment.

I believe it is there to allow for differences in the treatment of whitespace between tokens.

Alisdair in https://wg21.link/p2843 ascribed some behaviour where the source of the tokens were significant to this implementation-definedness, but I suspect implementation bugs.

I believe the condition we want to apply is that the characters sequence of the spellings of the tokens must syntactically be a header-name.

@jensmaurer
Copy link
Member

I believe the condition we want to apply is that the characters sequence of the spellings of the tokens must syntactically be a header-name.

I also noticed we don't actually branch off to the first form; we just say "must of of that form", but we never said "and then we do as in the first form".

I now use "characters of the spellings", which is wording already existing in [cpp.stringize]. We still "attempt to form" a header-name from those characters; if we don't get a match, the formation fails and we bail (UB or ill-formed). Otherwise, we re-process as in the first form.

CWG3015

@hubert-reinterpretcast
Copy link
Collaborator Author

hubert-reinterpretcast commented Mar 24, 2025

The current interpretation taken by implementations is that __has_include is fine in macro definitions and during macro expansion and is only a problem if it survives macro expansion to end up "in the wrong place".

@jensmaurer, under the quoted interpretation, having __has_include, etc. within the limit parameter of #embed is okay. All that needs to be done (wording-wise) to make the lexing choose the header-name interpretation also within #embed directives. If the __has_include is in a prefix parameter, then it becomes ill-formed as part of the result of #embed. If it is part of an unused if_empty parameter, then it gets safely ignored.

Accepted by implementations (https://godbolt.org/z/TsxascPM3):

#define stdio nosuch
#if __has_embed(__FILE__ limit(\
  __has_include(<stdio.h>) ? 0 : 1\
)) == __STDC_EMBED_EMPTY__
#error Header name formed!
#endif

@hubert-reinterpretcast
Copy link
Collaborator Author

if we don't get a match, the formation fails and we bail (UB or ill-formed)

To avoid implementation-defined shenanigans regarding a match, I suggest:

The method by which a sequence of preprocessing tokens [ ... ] is combined [ ... ] is implementation-defined; however, the attempt shall succeed only if the characters of the spellings of the sequence form, respectively, a h-char-sequence or a q-char-sequence.

@jensmaurer
Copy link
Member

jensmaurer commented Mar 25, 2025

All that needs to be done (wording-wise) to make the lexing choose the header-name interpretation also within #embed directives.

Yes. Turns out it's not a header-name in this slight alteration:

#define X __has_include(<stdio.h>)
#define stdio nosuch
#if __has_embed(__FILE__ limit(\
  X ? 0 : 1\
)) == __STDC_EMBED_EMPTY__
#error Header name formed!
#endif

I find this mildly odd, but it's consistent with #if and #embed environments being "different" in a number of ways.

Wording in [lex.pptoken] fixed: CWG3015

To avoid implementation-defined shenanigans regarding a match, I suggest:

If the implementation leeway is just for the (non-)consideration of whitespace, we should directly say so.
Added "; the treatment of whitespace is implementation-defined." in CWG3015 for #include and #embed and removed the more general implementation-defined blurb.

The method by which a sequence of preprocessing tokens between a < and a > preprocessing token pair or a pair of " characters is combined ...

This is broken anyway, because a header-name is the entire thing including the delimiters (< or "), but this phrasing makes it sound like we're only dealing with the stuff inside the delimiters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants