Add proposal for parquet storage #6712

yeya24 · 2025-04-21T18:19:17Z

What this PR does:

This PR adds a design proposal for Parquet Storage in Cortex

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Ben Ye <[email protected]>

alanprot · 2025-04-21T18:44:37Z

Thanks @yeya24 This looks amazing!

There is some small differences on the data cols description here and what we did on prometheus-community/parquet-common#2 but overall looks very good.

Signed-off-by: Ben Ye <[email protected]>

SungJin1212 · 2025-04-22T01:04:16Z

It seems promising, thanks!

CharlieTLe · 2025-04-22T02:53:19Z

docs/proposals/parquet-storage.md

+
+### Data Format
+
+Following the current desgin of Cortex, each Parquet file contains at most 1 day of data.


Suggested change

Following the current desgin of Cortex, each Parquet file contains at most 1 day of data.

Following the current design of Cortex, each Parquet file contains at most 1 day of data.

rapphil · 2025-04-22T05:36:44Z

docs/proposals/parquet-storage.md

+| `s_lbl_{labelName}` | Values for a given label name. Rows are sorted by metric name | ByteArray (string) | RLE_DICTIONARY/Zstd/No | Yes |
+| `s_data_{n}` | Chunks columns (0 to data_cols_count). Each column contains data from `[n*duration, (n+1)*duration]` where duration is `24h/data_cols_count` | ByteArray (encoded chunks) | DeltaByteArray/Zstd/Yes | Yes |
+
+data_cols_count_md will be a parquet file metadata and its value is usually 3 but it can be configurable to adjust for different usecases.


data_cols_count_md this the same as data_cols_count?

Yes I will update to data_cols_count

rapphil · 2025-04-22T05:40:38Z

docs/proposals/parquet-storage.md

+   - Maintains row and row group order matching the Labels file
+   - Contains multiple chunk columns for time-series data. Each column covering a time range of chunks: 0-8h, 8h-16h, 16-24h.
+
+#### Column Specifications


is this column specification originated by joining the two parquet files above?

maybe I'm missing something obvious, but would be nice to include the rationale for splitting in two files. rows are ordered in the same way both files, so I'm not sure why they are need to be split.

Yes you are definitely right. We experimented with a single file with both labels and chunks. The reason of splitting to 2 files is that labels and chunks have kind of different size and read pattern. We are able to configure parquet reader differently for read buffer so that we can read more efficiently.

There is a also POC from Cloudflare which uses 2 files so that they can choose to store those files differently. They can cache labels file inmemory and leave chunks file on object store because of size for more efficient index queries.

Overall, 2 files seem a more flexible approach. Maybe @alanprot can share more info.

2 files are useful since the labels parquet file is tiny and can be stored on disk or memoized if wanted. This reduces requests to object storage for any label related lookups.

rapphil · 2025-04-22T05:41:49Z

docs/proposals/parquet-storage.md

+
+## Background
+
+Since the introduction of Block Storage in Cortex, TSDB format and Store Gateway is the de-facto way to query long term data on object storage. However, it presents several significant chanllenges:


Suggested change

Since the introduction of Block Storage in Cortex, TSDB format and Store Gateway is the de-facto way to query long term data on object storage. However, it presents several significant chanllenges:

Since the introduction of Block Storage in Cortex, TSDB format and Store Gateway is the de-facto way to query long term data on object storage. However, it presents several significant challenges:

justinjung04

This is great, thank you!

justinjung04 · 2025-04-22T05:56:20Z

docs/proposals/parquet-storage.md

+
+It is similar to compactor, however, it only converts single block. The converted Parquet files will be stored in the same TSDB block folder so that the lifecycle of Parquet file will be managed together with the block.
+
+Only certain blocks can be configured to convert to Parquet file and it will be block duration based, for example we only convert if block duration is >= 12h.


Is it safe to asssume that the blocks converted to Parquet file will not be further compacted? If not, how do we manage to compact blocks with parquet files?

In this proposal parquet files won't be further compacted. As it requires a compactor which takes in multiple parquet files and outputs 1 parquet file. This is added as a non goal.

This proposal we only add a parquet converter which takes 1 TSDB block and outputs 1 parquet file.

justinjung04 · 2025-04-22T06:02:36Z

docs/proposals/parquet-storage.md

+2. **Chunks Parquet File**
+   - Maintains row and row group order matching the Labels file
+   - Contains multiple chunk columns for time-series data. Each column covering a time range of chunks: 0-8h, 8h-16h, 16-24h.


Could you clarify the benefit of chunks parquet file over the existing chunks file? The labels parquet file clearly has advantage of fetching all series for label matchers, but once we have the final list of series to fetch, I'm not sure how having chunks in parquet file will help with performance or memory utilization.

It enables fetching chunks for 8h ( by default, configurable ) in groups. This reduces requests to object storage if you dont need to fetch chunks for all 24h of the block.

I think the main benefit here is overfetch. We have seen quite bad overfetch on Store Gateway for chunks where fetched chunk bytes vs touched chunk bytes is 20:1.

There could be overfetch in Parquet as well but the smallest read unit in Parquet is page and the size is configurable. From our initial test the overfetch seems much better.

Parquet's compression and encoding is another nice addition as we able to see 30% of chunk size reduction.

Signed-off-by: Ben Ye <[email protected]>

matej-g

Nice work 👍, I added couple questions

matej-g · 2025-04-22T08:48:26Z

docs/proposals/parquet-storage.md

+
+### Data Format
+
+Following the current design of Cortex, each Parquet file contains at most 1 day of data.


I don't know nearly enough about the format, but I'm missing some more justification / explanation why 1 day, beyond current design.

In Cortex, largest block we have is 1 day. It is configurable though.
I can remove this sentence. In short, Parquet file should have the same duration as of TSDB block as it is converted from TSDB blocks

matej-g · 2025-04-22T08:49:22Z

docs/proposals/parquet-storage.md

+
+2. **Chunks Parquet File**
+   - Maintains row and row group order matching the Labels file
+   - Contains multiple chunk columns for time-series data. Each column covering a time range of chunks: 0-8h, 8h-16h, 16-24h.


If I understood from below conversation, this is an example of how the columns can be split. Maybe it would good to adjust this point to clarify it's an example of how this can be done (or default?).

Good point. Will add

matej-g · 2025-04-22T08:53:08Z

docs/proposals/parquet-storage.md

+| `s_hash` | Hash of all labels | INT64 | None/Zstd/Yes | No |
+| `s_col_indexes` | Bitmap indicating which columns store the label set for this row (series) | ByteArray (bitmap) | DeltaByteArray/Zstd/Yes | Yes |
+| `s_lbl_{labelName}` | Values for a given label name. Rows are sorted by metric name | ByteArray (string) | RLE_DICTIONARY/Zstd/No | Yes |
+| `s_data_{n}` | Chunks columns (0 to data_cols_count). Each column contains data from `[n*duration, (n+1)*duration]` where duration is `24h/data_cols_count` | ByteArray (encoded chunks) | DeltaByteArray/Zstd/Yes | Yes |


Maybe a dumb question, but should the column count always result in columns being split by full hours (e.g. every 6 / 8 / 12 hours)? Are there any consequences if that's not so?

I will mention we re-encoded the chunks a bit at writer so that they fall into the configured column time ranges

Signed-off-by: Ben Ye <[email protected]>

rapphil · 2025-04-22T16:42:00Z

docs/proposals/parquet-storage.md

+
+## Open Questions
+
+1. Should we use Parquet Gateway to replace Store Gateway


Having a fully compatible API between Parquet Gateway and Store Gateway would make the migration easier as well no?

It is easier to just replace querier with parquet querier I believe as it is stateless.
But yeah migration is not part of the proposal. We probably need to create a migration guide later.

danielblando · 2025-04-22T17:08:32Z

docs/proposals/parquet-storage.md

+
+Similar to the existing `distributorQueryable` and `blockStorageQueryable`, Parquet queryable is a queryable implementation which allows Cortex to query parquet files and can be used in both Cortex Querier and Ruler.
+
+If Parquet queryable is enabled, block storage queryable will be disabled and Cortex querier will not query Store Gateway anymore. `distributorQueryable` remains unchanged so it still queries Ingesters.


If we have the parquet converter configured for only blocks >= 12h. Would blockStorageQueryable still be enabled when querying blocks < 12h?

Existing flags like query Ingester within and query store after will still be used so we can fallback to Ingesters.
We can also do some kind of fallback to store gateway as you said during migration phase. But those are implementation details.

Long term solution is for compactor to create and compact parquet files so same as what we have today.

danielblando · 2025-04-22T17:13:00Z

docs/proposals/parquet-storage.md

+
+If Parquet queryable is enabled, block storage queryable will be disabled and Cortex querier will not query Store Gateway anymore. `distributorQueryable` remains unchanged so it still queries Ingesters.
+
+Parquet queryable uses bucket index to discovers  parquet files in object storage. The bucket index is the same as the existing TSDB bucket index file, but using a different name `bucket-index-parquet.json.gz`. It is updated periodically by Cortex Compactor/Parquet Converter if parquet storage is enabled.


How do we still query the 12h tsdb blocks while we dont have them in the parquet index?
For example if we have 8h blocks and we are compacting to 12h blocks, I assume after the new 12h tsdb blocks are created we convert it to parquet, but while this doesnt finish we only have them in the default index and the 8h blocks would be removed from default index by compactor, no?

We cannot query them until the parquet file is created and added to bucket index. There is some tradeoff here. Users can configure any option that works for them

You can configure convert only 12h+ blocks. This has a longer delay in parquet file creation but less resource required for convertion. Users need to expect fallback to some other storage to query the data

You can configure converting 2h block. Maybe we can configure it to only convert blocks after deduplication so parquet files are available earlier but more compactors are required to do convert

Cool, makes more sense now. I think then parquet would need to do some kind of merge between both index for the 12h scenario.

add proposal for parquet storage

ef43b28

Signed-off-by: Ben Ye <[email protected]>

pull-request-size bot added the size/L label Apr 21, 2025

dosubot bot added the component/documentation label Apr 21, 2025

typo fix

b4cf933

Signed-off-by: Ben Ye <[email protected]>

alanprot approved these changes Apr 21, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Apr 21, 2025

SungJin1212 approved these changes Apr 22, 2025

View reviewed changes

CharlieTLe reviewed Apr 22, 2025

View reviewed changes

CharlieTLe approved these changes Apr 22, 2025

View reviewed changes

rapphil reviewed Apr 22, 2025

View reviewed changes

justinjung04 reviewed Apr 22, 2025

View reviewed changes

address comments and update bucket index

475c95f

Signed-off-by: Ben Ye <[email protected]>

matej-g reviewed Apr 22, 2025

View reviewed changes

mention how chunks work

8297c53

Signed-off-by: Ben Ye <[email protected]>

rapphil reviewed Apr 22, 2025

View reviewed changes

danielblando reviewed Apr 22, 2025

View reviewed changes

alanprot mentioned this pull request Apr 22, 2025

Implementing Parquet Converter #6716

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add proposal for parquet storage #6712

Add proposal for parquet storage #6712

yeya24 commented Apr 21, 2025

alanprot commented Apr 21, 2025

SungJin1212 commented Apr 22, 2025

CharlieTLe Apr 22, 2025

rapphil Apr 22, 2025

yeya24 Apr 22, 2025

rapphil Apr 22, 2025

yeya24 Apr 22, 2025

MichaHoffmann Apr 22, 2025 •

edited

Loading

rapphil Apr 22, 2025

justinjung04 left a comment

justinjung04 Apr 22, 2025

yeya24 Apr 22, 2025

justinjung04 Apr 22, 2025

MichaHoffmann Apr 22, 2025

yeya24 Apr 22, 2025

matej-g left a comment

matej-g Apr 22, 2025

yeya24 Apr 22, 2025

matej-g Apr 22, 2025

yeya24 Apr 22, 2025

matej-g Apr 22, 2025

yeya24 Apr 22, 2025

rapphil Apr 22, 2025

yeya24 Apr 22, 2025 •

edited

Loading

danielblando Apr 22, 2025 •

edited

Loading

yeya24 Apr 22, 2025

danielblando Apr 22, 2025 •

edited

Loading

yeya24 Apr 22, 2025 •

edited

Loading

danielblando Apr 22, 2025


		### Data Format

		Following the current desgin of Cortex, each Parquet file contains at most 1 day of data.

	Following the current desgin of Cortex, each Parquet file contains at most 1 day of data.
	Following the current design of Cortex, each Parquet file contains at most 1 day of data.


		## Background

		Since the introduction of Block Storage in Cortex, TSDB format and Store Gateway is the de-facto way to query long term data on object storage. However, it presents several significant chanllenges:


		It is similar to compactor, however, it only converts single block. The converted Parquet files will be stored in the same TSDB block folder so that the lifecycle of Parquet file will be managed together with the block.

		Only certain blocks can be configured to convert to Parquet file and it will be block duration based, for example we only convert if block duration is >= 12h.


		## Open Questions

		1. Should we use Parquet Gateway to replace Store Gateway


		Similar to the existing `distributorQueryable` and `blockStorageQueryable`, Parquet queryable is a queryable implementation which allows Cortex to query parquet files and can be used in both Cortex Querier and Ruler.

		If Parquet queryable is enabled, block storage queryable will be disabled and Cortex querier will not query Store Gateway anymore. `distributorQueryable` remains unchanged so it still queries Ingesters.


		If Parquet queryable is enabled, block storage queryable will be disabled and Cortex querier will not query Store Gateway anymore. `distributorQueryable` remains unchanged so it still queries Ingesters.

		Parquet queryable uses bucket index to discovers parquet files in object storage. The bucket index is the same as the existing TSDB bucket index file, but using a different name `bucket-index-parquet.json.gz`. It is updated periodically by Cortex Compactor/Parquet Converter if parquet storage is enabled.

Add proposal for parquet storage #6712

Are you sure you want to change the base?

Add proposal for parquet storage #6712

Conversation

yeya24 commented Apr 21, 2025

alanprot commented Apr 21, 2025

SungJin1212 commented Apr 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaHoffmann Apr 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinjung04 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matej-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeya24 Apr 22, 2025 • edited Loading

Choose a reason for hiding this comment

danielblando Apr 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielblando Apr 22, 2025 • edited Loading

Choose a reason for hiding this comment

yeya24 Apr 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaHoffmann Apr 22, 2025 •

edited

Loading

yeya24 Apr 22, 2025 •

edited

Loading

danielblando Apr 22, 2025 •

edited

Loading

danielblando Apr 22, 2025 •

edited

Loading

yeya24 Apr 22, 2025 •

edited

Loading