Skip to content

Distributed compute refactoring #1047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

Conversation

dreadatour
Copy link
Contributor

Draft, description to be added soon.

@dreadatour dreadatour requested a review from Copilot April 19, 2025 18:46
@dreadatour dreadatour self-assigned this Apr 19, 2025
@dreadatour dreadatour marked this pull request as draft April 19, 2025 18:46
Copy link

cloudflare-workers-and-pages bot commented Apr 19, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: c86c9a6
Status: ✅  Deploy successful!
Preview URL: https://73b23262.datachain-documentation.pages.dev
Branch Preview URL: https://distributed-refactoring.datachain-documentation.pages.dev

View logs

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a refactoring of the distributed compute components along with several improvements and cleanups across tests, UDF dispatching, SQL query handling, and CLI commands. Key changes include:

  • Consolidation and cleanup of test functions and internal UDF dispatch APIs.
  • Refinements in error handling, queue management, and process parallelism to improve robustness.
  • Updates in SQL query constructions, particularly in the metastore update, that require careful validation.

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/test_dispatch.py Consolidation of imports and removal of a redundant test case.
tests/test_query_e2e.py Updated expected process return codes in the interruption test.
src/datachain/utils.py Minor improvements in parallel processing configuration.
src/datachain/query/utils.py Stricter type checking for query column retrieval.
src/datachain/query/udf.py Removal of the unused abstract static method run_worker.
src/datachain/query/dispatch.py Refactoring of UDF dispatch logic with new catalog property and queue handling.
src/datachain/query/dataset.py Improved exception handling and use of walrus operator in dataset population.
src/datachain/query/batch.py Deferred import for SELECT_BATCH_SIZE moved into the function.
src/datachain/lib/listing.py Updated list function naming to better reflect its purpose.
src/datachain/data_storage/warehouse.py New method added to select dataset rows from a list of IDs.
src/datachain/data_storage/metastore.py Revised dataset version update with a new SQL WHERE clause.
src/datachain/cli/parser/utils.py Updated internal commands for argument parsing.
src/datachain/cli/init.py Adjustments to handle UDF dispatch commands.
Comments suppressed due to low confidence (5)

tests/unit/test_dispatch.py:38

  • The removal of the test_send_stop_signal_to_workers function reduces the test coverage for UDFDispatcher's behavior when sending stop signals. Consider reintroducing a test or ensuring coverage through alternative tests.
-    def test_send_stop_signal_to_workers():

tests/test_query_e2e.py:200

  • The expected process return code has been changed from 1 to 11. Please verify that the update to (interrupt_exit_code, 11) is intentional and consistent with the underlying interruption handling logic.
-        if process.returncode not in (interrupt_exit_code, 1):

src/datachain/query/udf.py:44

  • The abstract static method run_worker has been removed. Ensure that its removal aligns with the overall UDF design and that no consumers rely on this method for worker execution.
-    @staticmethod

src/datachain/query/dispatch.py:166

  • The call to self.run_udf_parallel within run_udf may unintentionally recurse into the same method rather than invoking the separate parallel processing logic defined later. Consider reviewing the method naming or call structure to avoid potential confusion or recursion issues.
+            self.run_udf_parallel(n_workers, input_rows, download_cb, processed_cb, generated_cb)

src/datachain/query/utils.py:25

  • The updated check now raises an error if the column is not an instance of Column, which might exclude valid ColumnElement types like TextClause. Confirm that this stricter type check is intended.
+    if col is None or not isinstance(col, Column):

dv = self._datasets_versions
self.db.execute(
self._datasets_versions_update()
.where(dv.c.dataset_id == dataset.id and dv.c.version == version)
Copy link
Preview

Copilot AI Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the Python 'and' operator in the SQL WHERE clause may not build the intended SQL expression. Consider replacing it with sqlalchemy.and_(dv.c.dataset_id == dataset.id, dv.c.version == version) to ensure proper SQL clause formation.

Suggested change
.where(dv.c.dataset_id == dataset.id and dv.c.version == version)
.where(and_(dv.c.dataset_id == dataset.id, dv.c.version == version))

Copilot uses AI. Check for mistakes.

Copy link

codecov bot commented Apr 19, 2025

Codecov Report

Attention: Patch coverage is 73.24561% with 61 lines in your changes missing coverage. Please review.

Project coverage is 88.09%. Comparing base (e425394) to head (c86c9a6).

Files with missing lines Patch % Lines
src/datachain/query/dispatch.py 64.70% 19 Missing and 5 partials ⚠️
src/datachain/data_storage/metastore.py 56.09% 8 Missing and 10 partials ⚠️
src/datachain/query/dataset.py 55.55% 6 Missing and 2 partials ⚠️
src/datachain/data_storage/warehouse.py 73.91% 4 Missing and 2 partials ⚠️
src/datachain/lib/udf.py 70.00% 2 Missing and 1 partial ⚠️
src/datachain/utils.py 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1047      +/-   ##
==========================================
- Coverage   88.19%   88.09%   -0.10%     
==========================================
  Files         146      146              
  Lines       12507    12570      +63     
  Branches     1742     1755      +13     
==========================================
+ Hits        11030    11073      +43     
- Misses       1053     1065      +12     
- Partials      424      432       +8     
Flag Coverage Δ
datachain 88.01% <73.24%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant