Skip to content

Issue with driver not honoring the server keep-alive timeout settings #515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 28, 2020

Conversation

amitsharma10
Copy link

@amitsharma10 amitsharma10 commented Oct 30, 2020

We faced an issue where the client was getting broken pipe error while sending a request to insert data into clickhouse. We figured that the server had <keep_alive_timeout>3</keep_alive_timeout> and client (driver) was trying to keep the connection alive for 30 seconds (default).
Ideally, client shouldn't try to use it's own settings for keep-alive when server responds with a different value. If server doesn't respond with a value of timeout, TCP defaults should be used.

The chances of getting this issue are more when data is written as an InputStream as HttpClient can't retry that request due to stream processing.

After we fixed and deployed this change locally, we haven't received any broke pipe issues.

@amitsharma10
Copy link
Author

@alexey-milovidov Is this something you can review ?

@zhicwu
Copy link
Contributor

zhicwu commented Dec 7, 2020

@amitsharma10, great work! Would you mind to change target to develop branch? I can do more test there and make it part of the upcoming 0.2.5 release.

@amitsharma10 amitsharma10 changed the base branch from master to develop December 7, 2020 05:10
@amitsharma10
Copy link
Author

@zhicwu I have updated the target to develop branch, please go ahead with review.

@amitsharma10
Copy link
Author

@zhicwu Please let me know if you need any help with testing/review.

@zhicwu
Copy link
Contributor

zhicwu commented Dec 28, 2020

Thank you @amitsharma10 and apologize for the late response. I'm good for the change.

Unit test should be added in general, but I think we're good as it's been tested in your environment for a while. Moreover, I'll start to create a new branch to refactor the code by removing unnecessary dependencies, replacing httpclient with light4j, and maybe multi-protocol support(http, grpc and native).

@den-crane
Copy link
Collaborator

It seems it does not solve the issues

I am still getting:
ClickHouse exception, code: 1002, host: localhost, port: 8123; localhost:8123 failed to respond [DB Errorcode=1002]

@zhicwu
Copy link
Contributor

zhicwu commented Jan 7, 2021

It seems it does not solve the issues

I am still getting:
ClickHouse exception, code: 1002, host: localhost, port: 8123; localhost:8123 failed to respond [DB Errorcode=1002]

Definitely not the message you want to see when you wake up in the morning ;) This is probably the most critical issue needs to be fixed. I'm working on performance/stress test and I'll look into this today.

@amitsharma10
Copy link
Author

Would you mind sharing your test case? This error doesn’t seem to be caused by the code in this PR

@zhicwu
Copy link
Contributor

zhicwu commented Jan 7, 2021

Would you mind sharing your test case? This error doesn’t seem to be caused by the code in this PR

I was kind of hoping this PR can fix or mitigate the failed to respond issue. It happens randomly at my end and stress test didn't help(because the error happens when server closed the connection). Anyway, I think we may take retry for this specific case as a workaround, before replacing httpclient with other lib like light4j.

@den-crane, you may give this a shot and see if it works at your end. I'll see if I can find a way to reproduce the issue and add unit test accordingly.

@gj-zhang
Copy link

gj-zhang commented Jan 8, 2021

It seems it does not solve the issues

I am still getting:
ClickHouse exception, code: 1002, host: localhost, port: 8123; localhost:8123 failed to respond [DB Errorcode=1002]

me too.
And i print start time and end time around executeQuery method. It takes about 10 milliseconds.The error frequency is about 2 times a day, one of which is fixed at the time of maximum write traffic.

Another phenomenon is that my server's RECV-Q is often very large
ClickHouse/ClickHouse#18667 (comment)

and i change the httpclientbuilder with #515 changed code

@zhicwu
Copy link
Contributor

zhicwu commented Jan 8, 2021

@gj-zhang , did you try 0.2.5-SNAPSHOT which contains a workaround(see PR #540)? It seems working at my end. Anyway, I'll try to implement tests to further validate the fix.

@gj-zhang
Copy link

@gj-zhang , did you try 0.2.5-SNAPSHOT which contains a workaround(see PR #540)? It seems working at my end. Anyway, I'll try to implement tests to further validate the fix.

with reuse strategy and without retry

@gj-zhang
Copy link

@gj-zhang , did you try 0.2.5-SNAPSHOT which contains a workaround(see PR #540)? It seems working at my end. Anyway, I'll try to implement tests to further validate the fix.

hi, do you resolve this problem?

@zhicwu
Copy link
Contributor

zhicwu commented Jan 20, 2021

hi, do you resolve this problem?

The pull request is WIP as we should retry request only when the sql is idempotent. I'm working on a loose parser using JavaCC(with tailored grammar from the ANTLR4 version), which can be used to check idempotency of given query. I'll try to make it in a day or two. On other hand, I noticed a similar error happens randomly in batch inserting during regression test, which won't be fixed by the PR.

Anyway, meantime(before 0.2.5 release), if your work is just about query(instead of mutation/DDL), I'd suggest you to try the snapshot build, which works well in the past weeks at my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants