CWG3015 [lex.pptoken] Implementation divergence in _header-name_ tokenization for `__has_include` #690

hubert-reinterpretcast · 2025-03-23T03:53:17Z

Full name of submitter (unless configured in github; will be published with the issue): Hubert Tong

Reference (section label): lex.pptoken, lex.phases, cpp.cond

Link to reflector thread (if any): N/A

Issue description:
The specification for __has_include is unclear on how any potential tokenization as a header-name occurs relative to has-include-expression syntax matching and macro expansion.

Consider: https://godbolt.org/z/ThK75Evb1

#define EMPTY
#define IGNORE(X)
#define stdio nosuch
#if __has_include(EMPTY <stdio.h>)
#error Hello! Header name formed!
#endif

Given that, before macro expansion, its is potentially unknown whether <stdio.h> is within a has-include-expression (e.g., if EMPTY expands to <iostream>) IGNORE(), it is unclear whether https://wg21.link/lex.pptoken#4.3.2 specifies that <stdio.h> in the above forms a header-name. If we take phase 3 of [lex.phases]/1 as separate from phase 4, then a potential interpretation is that <stdio.h> in the above is tokenized as a header-name regardless of whether it is part of a has-include-expression after macro expansion because it appears after __has_include( and before a paired ). It is also a valid interpretation that <stdio.h> is not tokenized as a header-name because it is not the immediately after __has_include(.

There is implementation divergence: Clang and both MSVC preprocessor implementations form a header-name; GCC and EDG do not. It actually seems that Clang and MSVC do not maintain proper separation between phase 3 and phase 4.

Suggested resolution:
Replace https://wg21.link/lex.pptoken#4.3.2 to say:

in phase 3 of translation, immediately after a preprocessing token sequence of __has_include followed immediately by (.

See also https://wg21.link/CWG2190 (which may be NAD).

The text was updated successfully, but these errors were encountered:

jensmaurer · 2025-03-23T07:34:05Z

We want to keep the lexer as simple as possible, thus I like the proposed resolution. (off topic: If we ever get f-literals, a "simple lexer" will likely be a thing of the past.)

The examples offered here that would be better handled by an alternative approach feel pathologically contrived.

jensmaurer · 2025-03-23T13:25:45Z

CWG3015

hubert-reinterpretcast · 2025-03-23T13:44:11Z

@jensmaurer,

The new text (and the parallel text it originated from) should at most be a note now.

The second form of has-embed-expression is considered only if the first form does not match, in which case the parenthesized preprocessing tokens are processed as if they were the pp-tokens in the second form of a #embed directive (15.4 [cpp.embed]).

It is normatively redundant or conflicting:

There is no grammar ambiguity between the two forms since the choice of preprocessing token occurs in normative text elsewhere.
https://wg21.link/cpp.cond#11 specifies macro replacement without regard to the delimiters of has-include-expression‌s, etc.
The identification of the preprocessing tokens corresponding to a has-include-expression, etc. can be taken to happen as late as https://wg21.link/cpp.cond#12 (this should be clarified).

jensmaurer · 2025-03-23T14:10:54Z

The identification of the preprocessing tokens corresponding to a has-include-expression, etc. can be taken to happen as late as https://wg21.link/cpp.cond#12 (this should be clarified).

Hm... Interesting.

#define X __has_include ( <stdio.h> )   // does this form a header-name here?
#define stdio nonesuch              // ... in which case this #define would have no effect...
#if !X                                       // ... here
#error no <stdio.h> 
#endif

gcc and clang believe that no header-name token is formed "early", consistent with the current wording (the special rule for __has_include( only applies in a #if or #elif directive).

So, yes, the "second form" rules should be struck for __has_include and __has_embed.

hubert-reinterpretcast · 2025-03-23T15:47:41Z

gcc and clang believe that no header-name token is formed "early", consistent with the current wording (the special rule for __has_include( only applies in a #if or #elif directive).

The wording in the Working Draft leads to the logic that either the __has_include token is part of a has-include-expression and the header-name token has to be formed or the __has_include token is not part of a has-include-expression and the program is ill-formed under https://wg21.link/cpp.cond#8:

The identifiers __has_include and __has_cpp_attribute shall not appear in any context not mentioned in this subclause.

jensmaurer · 2025-03-23T17:33:24Z

Agreed. I've removed those "second form" rules as redundant with the grammar.
I've also added a half- sentence in [cpp.cond] that prevents another round of macro replacement for __has_embed.

CWG3015

hubert-reinterpretcast · 2025-03-23T21:47:38Z

@jensmaurer,

With the rewrite, we always get UB from:

If the directive resulting after all replacements does not match the previous form, the behavior is undefined.

because the replacement produces a series of preprocessing tokens that would not consist of a single header-name.

The previous wording relied on an implicit interpretation of the resulting preprocessing tokens as raw source characters.

hubert-reinterpretcast · 2025-03-23T22:14:05Z

For info: Implementations do not form header-name tokens when __has_include( appears with the arguments of a function-like macro. Following the logic from #690 (comment), the implementations are non-conforming because they produce no diagnostic.

#define IDENT(X) X
#define stdio nosuch
#if IDENT(__has_include(<stdio.h>))
#else
#error No header-name formed!
#endif

https://godbolt.org/z/E8vzEx1Wq

jensmaurer · 2025-03-23T22:24:54Z

With the rewrite, we always get UB from: ...

I thought that was addressed by this sentence:

The method by which a sequence of preprocessing tokens between a < and a > preprocessing token pair or a
pair of " characters is combined into a single header name preprocessing token is implementation-defined.

which conveys the expectation that the character sequence is merged into a header-name somehow.

Do you want to make this more explicit / more integrated in the "go back to the first form" delegation?

Implementations do not form header-name tokens when __has_include( appears with the arguments of a function-like macro.

With the new wording, I think those implementations are non-conforming, because [lex.pptoken] considers only preprocessing tokens (in phase 3), not phase 4 grammar entities such as has-embed-expressions. Anywhere you have __has_include ( something, "something" is converted into a header-name if possible. As a second check, __has_include might be ill-formed when it appears in an untoward location.

hubert-reinterpretcast · 2025-03-23T23:05:44Z

Do you want to make this more explicit / more integrated in the "go back to the first form" delegation?

Yes. Also, the status quo technically establishes UB before any implementation-defined formation of header-name tokens (e.g., if there is a raw string containing a newline, UB exists).

As a second check, __has_include might be ill-formed when it appears in an untoward location.

That's an area that lacks clarity (but not too much of a problem). The current interpretation taken by implementations is that __has_include is fine in macro definitions and during macro expansion and is only a problem if it survives macro expansion to end up "in the wrong place".

An observation with the new wording is that, under the ill-formed only past macro-expansion interpretation, it is now possible for function-like macro arguments to contain header-name tokens. This introduces a need to update https://wg21.link/cpp.stringize.

jensmaurer · 2025-03-24T07:57:01Z

Do you want to make this more explicit / more integrated in the "go back to the first form" delegation?

I've (hopefully) addressed that: CWG3015

This introduces a need to update https://wg21.link/cpp.stringize.

I'm not seeing that.

If, in the replacement list, a parameter is immediately
preceded by a # preprocessing token, both are replaced by a single character string literal preprocessing token
that contains the spelling of the preprocessing token sequence for the corresponding argument (excluding
placemarker tokens).

This seems to apply to header-name preprocessing tokens all the same.

hubert-reinterpretcast · 2025-03-24T15:32:14Z

This seems to apply to header-name preprocessing tokens all the same.

See:

Otherwise, the original spelling of each preprocessing token in the stringizing argument is retained in the character string literal, except for special handling for producing the spelling of string-literals and character-literals: a \ character is inserted before each " and \ character of a character-literal or string-literal (including the delimiting " characters).

The spelling of header-name preprocessing tokens can also contain \ and " characters.

jensmaurer · 2025-03-24T15:38:39Z

Right. Fixed: CWG3015

hubert-reinterpretcast · 2025-03-24T15:43:50Z

I've (hopefully) addressed that: CWG3015

I think it needs an additional:

[ ... ] or if the header-name formed is not a valid header-name, the behavior is undefined.

because the implementation-defined method could "succeed" in forming an ill-formed header-name.

An example of such header-name‌s in the wild include

<"R(
)">

jensmaurer · 2025-03-24T15:57:25Z

I don't understand. The suggested wording says "an attempt is made to form..." Obviously, something having newlines in it is not a header-name per the lexer grammar in [lex.header], thus the "attempt to form" fails, we retain the non-header-name preprocessing tokens, and the match against the first form fails.

Saying "if a template-head is not a syntactically valid template-head" feels very much redundant.

hubert-reinterpretcast · 2025-03-24T16:22:29Z

thus the "attempt to form" fails, we retain the non-header-name preprocessing tokens, and the match against the first form fails

Let's try an alternative scenario. While I am unaware of any implementation that does so, under the new wording the "implementation-defined method" can take a string "\"" and give you <">. Under the old wording, there was unavoidable UB.

I guess my wording attempt does not cover this case though.

jensmaurer · 2025-03-24T16:38:31Z

under the new wording the "implementation-defined method" can take a string """ and give you <">. Under the old wording, there was unavoidable UB.

I'd be happy to remove the "implementation-defined method"; I don't know what it's intended to do other than allow strange implementations such as the one you offered in your last comment.
(I'd like to point out that the "implementation-defined method" wording exists in the status quo.)

So, we're looking at

#define S "\""   // string-literal
#include S        // undefined behavior, because the _header-name_ is "\"  (single backslash within double-quotes) followed by a lone "

except that [lex.header] p2 says the backslash is conditionally-supported. However, that paragraph does not seem to permit converting S (in the example above) to a header-name representing a source file whose name is " (because lexing the header-name must stop at the second ", regardless of any backslashes).

hubert-reinterpretcast · 2025-03-24T16:48:05Z

I'd be happy to remove the "implementation-defined method"; I don't know what it's intended to do other than allow strange implementations such as the one you offered in your last comment.

I believe it is there to allow for differences in the treatment of whitespace between tokens.

Alisdair in https://wg21.link/p2843 ascribed some behaviour where the source of the tokens were significant to this implementation-definedness, but I suspect implementation bugs.

I believe the condition we want to apply is that the characters sequence of the spellings of the tokens must syntactically be a header-name.

jensmaurer · 2025-03-24T20:40:37Z

I believe the condition we want to apply is that the characters sequence of the spellings of the tokens must syntactically be a header-name.

I also noticed we don't actually branch off to the first form; we just say "must of of that form", but we never said "and then we do as in the first form".

I now use "characters of the spellings", which is wording already existing in [cpp.stringize]. We still "attempt to form" a header-name from those characters; if we don't get a match, the formation fails and we bail (UB or ill-formed). Otherwise, we re-process as in the first form.

CWG3015

hubert-reinterpretcast · 2025-03-24T21:45:51Z

The current interpretation taken by implementations is that __has_include is fine in macro definitions and during macro expansion and is only a problem if it survives macro expansion to end up "in the wrong place".

@jensmaurer, under the quoted interpretation, having __has_include, etc. within the limit parameter of #embed is okay. All that needs to be done (wording-wise) to make the lexing choose the header-name interpretation also within #embed directives. If the __has_include is in a prefix parameter, then it becomes ill-formed as part of the result of #embed. If it is part of an unused if_empty parameter, then it gets safely ignored.

Accepted by implementations (https://godbolt.org/z/TsxascPM3):

#define stdio nosuch
#if __has_embed(__FILE__ limit(\
  __has_include(<stdio.h>) ? 0 : 1\
)) == __STDC_EMBED_EMPTY__
#error Header name formed!
#endif

hubert-reinterpretcast · 2025-03-24T23:15:19Z

if we don't get a match, the formation fails and we bail (UB or ill-formed)

To avoid implementation-defined shenanigans regarding a match, I suggest:

The method by which a sequence of preprocessing tokens [ ... ] is combined [ ... ] is implementation-defined; however, the attempt shall succeed only if the characters of the spellings of the sequence form, respectively, a h-char-sequence or a q-char-sequence.

jensmaurer · 2025-03-25T10:29:51Z

All that needs to be done (wording-wise) to make the lexing choose the header-name interpretation also within #embed directives.

Yes. Turns out it's not a header-name in this slight alteration:

#define X __has_include(<stdio.h>)
#define stdio nosuch
#if __has_embed(__FILE__ limit(\
  X ? 0 : 1\
)) == __STDC_EMBED_EMPTY__
#error Header name formed!
#endif

I find this mildly odd, but it's consistent with #if and #embed environments being "different" in a number of ways.

Wording in [lex.pptoken] fixed: CWG3015

To avoid implementation-defined shenanigans regarding a match, I suggest:

If the implementation leeway is just for the (non-)consideration of whitespace, we should directly say so.
Added "; the treatment of whitespace is implementation-defined." in CWG3015 for #include and #embed and removed the more general implementation-defined blurb.

The method by which a sequence of preprocessing tokens between a < and a > preprocessing token pair or a pair of " characters is combined ...

This is broken anyway, because a header-name is the entire thing including the delimiters (< or "), but this phrasing makes it sound like we're only dealing with the stuff inside the delimiters.

jensmaurer changed the title ~~[lex.pptoken] Implementation divergence in _header-name_ tokenization for __has_include~~ CWG3015 [lex.pptoken] Implementation divergence in _header-name_ tokenization for __has_include Mar 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CWG3015 [lex.pptoken] Implementation divergence in _header-name_ tokenization for `__has_include` #690

CWG3015 [lex.pptoken] Implementation divergence in _header-name_ tokenization for `__has_include` #690

hubert-reinterpretcast commented Mar 23, 2025

jensmaurer commented Mar 23, 2025 •

edited

Loading

jensmaurer commented Mar 23, 2025

hubert-reinterpretcast commented Mar 23, 2025

jensmaurer commented Mar 23, 2025 •

edited

Loading

hubert-reinterpretcast commented Mar 23, 2025

jensmaurer commented Mar 23, 2025

hubert-reinterpretcast commented Mar 23, 2025

hubert-reinterpretcast commented Mar 23, 2025

jensmaurer commented Mar 23, 2025 •

edited

Loading

hubert-reinterpretcast commented Mar 23, 2025

jensmaurer commented Mar 24, 2025

hubert-reinterpretcast commented Mar 24, 2025

jensmaurer commented Mar 24, 2025

hubert-reinterpretcast commented Mar 24, 2025

jensmaurer commented Mar 24, 2025

hubert-reinterpretcast commented Mar 24, 2025 •

edited

Loading

jensmaurer commented Mar 24, 2025 •

edited

Loading

hubert-reinterpretcast commented Mar 24, 2025

jensmaurer commented Mar 24, 2025

hubert-reinterpretcast commented Mar 24, 2025 •

edited

Loading

hubert-reinterpretcast commented Mar 24, 2025

jensmaurer commented Mar 25, 2025 •

edited

Loading

CWG3015 [lex.pptoken] Implementation divergence in _header-name_ tokenization for __has_include #690

CWG3015 [lex.pptoken] Implementation divergence in _header-name_ tokenization for __has_include #690

Comments

hubert-reinterpretcast commented Mar 23, 2025

jensmaurer commented Mar 23, 2025 • edited Loading

jensmaurer commented Mar 23, 2025

hubert-reinterpretcast commented Mar 23, 2025

jensmaurer commented Mar 23, 2025 • edited Loading

hubert-reinterpretcast commented Mar 23, 2025

jensmaurer commented Mar 23, 2025

hubert-reinterpretcast commented Mar 23, 2025

hubert-reinterpretcast commented Mar 23, 2025

jensmaurer commented Mar 23, 2025 • edited Loading

hubert-reinterpretcast commented Mar 23, 2025

jensmaurer commented Mar 24, 2025

hubert-reinterpretcast commented Mar 24, 2025

jensmaurer commented Mar 24, 2025

hubert-reinterpretcast commented Mar 24, 2025

jensmaurer commented Mar 24, 2025

hubert-reinterpretcast commented Mar 24, 2025 • edited Loading

jensmaurer commented Mar 24, 2025 • edited Loading

hubert-reinterpretcast commented Mar 24, 2025

jensmaurer commented Mar 24, 2025

hubert-reinterpretcast commented Mar 24, 2025 • edited Loading

hubert-reinterpretcast commented Mar 24, 2025

jensmaurer commented Mar 25, 2025 • edited Loading

CWG3015 [lex.pptoken] Implementation divergence in _header-name_ tokenization for `__has_include` #690

CWG3015 [lex.pptoken] Implementation divergence in _header-name_ tokenization for `__has_include` #690

jensmaurer commented Mar 23, 2025 •

edited

Loading

jensmaurer commented Mar 23, 2025 •

edited

Loading

jensmaurer commented Mar 23, 2025 •

edited

Loading

hubert-reinterpretcast commented Mar 24, 2025 •

edited

Loading

jensmaurer commented Mar 24, 2025 •

edited

Loading

hubert-reinterpretcast commented Mar 24, 2025 •

edited

Loading

jensmaurer commented Mar 25, 2025 •

edited

Loading