Skip to content

Don't rewrite _map files on the save_principal_key redo #217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 24, 2025

Conversation

dAdAbird
Copy link
Member

We create a new WAL key during the extension init, which happens before the redo. This means that in case of a crash,
pg_tde_save_principal_key_redo was rewriting a WAL _map file and destroying a newly created key.

Since we emit an XLog record after the key was successfully written to the file (the file was created), we can safely assume that we should not change the file if it exists.

@dAdAbird dAdAbird requested a review from jeltz April 15, 2025 07:56
@dAdAbird dAdAbird requested a review from dutow as a code owner April 15, 2025 07:56
@jeltz
Copy link
Collaborator

jeltz commented Apr 15, 2025

I am a bit uncertain about this since I feel it goes against the spirit of what a redo is supposed to do. Like wouldn't this cause issues when replayed on the replica when we have WAL encryption enabled on it? I need to think on this a bit.

We create a new WAL key during the extension init, which happens before
the redo. This means that in case of a crash,
pg_tde_save_principal_key_redo was rewriting a WAL _map file and destroying
a newly created key.

Since we emit an XLog record after the key was successfully written to
the file (the file was created), we can safely assume that we should
not change the file if it exists.
@codecov-commenter
Copy link

codecov-commenter commented Apr 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.14%. Comparing base (e450170) to head (014e977).
Report is 2 commits behind head on TDE_REL_17_STABLE.

❌ Your project status has failed because the head coverage (78.14%) is below the target coverage (90.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@                  Coverage Diff                  @@
##           TDE_REL_17_STABLE     #217      +/-   ##
=====================================================
+ Coverage              77.14%   78.14%   +0.99%     
=====================================================
  Files                     22       22              
  Lines                   2490     2489       -1     
  Branches                 395      394       -1     
=====================================================
+ Hits                    1921     1945      +24     
+ Misses                   490      466      -24     
+ Partials                  79       78       -1     
Components Coverage Δ
access 78.67% <100.00%> (+0.53%) ⬆️
catalog 86.06% <ø> (+2.41%) ⬆️
common 92.50% <ø> (ø)
encryption 71.90% <ø> (ø)
keyring 72.07% <ø> (ø)
src 53.79% <ø> (ø)
smgr 98.03% <ø> (ø)
transam ∅ <ø> (∅)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@jeltz jeltz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not trust that this fix is correct but if it is a blocker for your testing of WAL stuff merge it after adressing the feedback to the tests.

$node->kill9;
PGTDE::append_to_result_file("-- server start");
PGTDE::append_to_result_file(
"-- check the recovery redo and the fix in PR #217 ('Failed to verify principal key header...' error)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to have a reference to a PR here but rather an explanation for what we are actually testing.


my $node = PostgreSQL::Test::Cluster->new('main');
$node->init;
$node->append_conf('postgresql.conf', "shared_preload_libraries = 'pg_tde'");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do a recovery tests shouldn't you set the checkpoint timeout to like 60 minutes to make it clear? Or is that overkill?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it makes sense to be sure

Tests to trigger redo routines after the server crash. It mostly checks
invariants when different redo functions might rewrite WAL keys created on the
init stage.

For PG-1539, PG-1541, PG-1468, PG-1413
@dAdAbird dAdAbird requested a review from jeltz April 24, 2025 10:10
Copy link
Collaborator

@jeltz jeltz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now.

@dAdAbird dAdAbird merged commit c8419b8 into percona:TDE_REL_17_STABLE Apr 24, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants