Issue with driver not honoring the server keep-alive timeout settings #515

amitsharma10 · 2020-10-30T04:07:59Z

We faced an issue where the client was getting broken pipe error while sending a request to insert data into clickhouse. We figured that the server had <keep_alive_timeout>3</keep_alive_timeout> and client (driver) was trying to keep the connection alive for 30 seconds (default).
Ideally, client shouldn't try to use it's own settings for keep-alive when server responds with a different value. If server doesn't respond with a value of timeout, TCP defaults should be used.

The chances of getting this issue are more when data is written as an InputStream as HttpClient can't retry that request due to stream processing.

After we fixed and deployed this change locally, we haven't received any broke pipe issues.

amitsharma10 · 2020-11-11T16:35:49Z

@alexey-milovidov Is this something you can review ?

zhicwu · 2020-12-07T02:19:40Z

@amitsharma10, great work! Would you mind to change target to develop branch? I can do more test there and make it part of the upcoming 0.2.5 release.

amitsharma10 · 2020-12-07T05:11:46Z

@zhicwu I have updated the target to develop branch, please go ahead with review.

amitsharma10 · 2020-12-24T19:26:27Z

@zhicwu Please let me know if you need any help with testing/review.

zhicwu · 2020-12-28T07:31:46Z

Thank you @amitsharma10 and apologize for the late response. I'm good for the change.

Unit test should be added in general, but I think we're good as it's been tested in your environment for a while. Moreover, I'll start to create a new branch to refactor the code by removing unnecessary dependencies, replacing httpclient with light4j, and maybe multi-protocol support(http, grpc and native).

den-crane · 2021-01-06T20:21:13Z

It seems it does not solve the issues

I am still getting:
ClickHouse exception, code: 1002, host: localhost, port: 8123; localhost:8123 failed to respond [DB Errorcode=1002]

zhicwu · 2021-01-07T00:01:03Z

It seems it does not solve the issues

I am still getting:
ClickHouse exception, code: 1002, host: localhost, port: 8123; localhost:8123 failed to respond [DB Errorcode=1002]

Definitely not the message you want to see when you wake up in the morning ;) This is probably the most critical issue needs to be fixed. I'm working on performance/stress test and I'll look into this today.

amitsharma10 · 2021-01-07T03:16:33Z

Would you mind sharing your test case? This error doesn’t seem to be caused by the code in this PR

zhicwu · 2021-01-07T11:21:10Z

Would you mind sharing your test case? This error doesn’t seem to be caused by the code in this PR

I was kind of hoping this PR can fix or mitigate the failed to respond issue. It happens randomly at my end and stress test didn't help(because the error happens when server closed the connection). Anyway, I think we may take retry for this specific case as a workaround, before replacing httpclient with other lib like light4j.

@den-crane, you may give this a shot and see if it works at your end. I'll see if I can find a way to reproduce the issue and add unit test accordingly.

gj-zhang · 2021-01-08T05:06:01Z

It seems it does not solve the issues

I am still getting:
ClickHouse exception, code: 1002, host: localhost, port: 8123; localhost:8123 failed to respond [DB Errorcode=1002]

me too.
And i print start time and end time around executeQuery method. It takes about 10 milliseconds.The error frequency is about 2 times a day, one of which is fixed at the time of maximum write traffic.

Another phenomenon is that my server's RECV-Q is often very large
ClickHouse/ClickHouse#18667 (comment)

and i change the httpclientbuilder with #515 changed code

zhicwu · 2021-01-08T06:32:55Z

@gj-zhang , did you try 0.2.5-SNAPSHOT which contains a workaround(see PR #540)? It seems working at my end. Anyway, I'll try to implement tests to further validate the fix.

gj-zhang · 2021-01-11T00:35:55Z

@gj-zhang , did you try 0.2.5-SNAPSHOT which contains a workaround(see PR #540)? It seems working at my end. Anyway, I'll try to implement tests to further validate the fix.

with reuse strategy and without retry

gj-zhang · 2021-01-20T01:46:40Z

@gj-zhang , did you try 0.2.5-SNAPSHOT which contains a workaround(see PR #540)? It seems working at my end. Anyway, I'll try to implement tests to further validate the fix.

hi, do you resolve this problem?

zhicwu · 2021-01-20T01:58:59Z

hi, do you resolve this problem?

The pull request is WIP as we should retry request only when the sql is idempotent. I'm working on a loose parser using JavaCC(with tailored grammar from the ANTLR4 version), which can be used to check idempotency of given query. I'll try to make it in a day or two. On other hand, I noticed a similar error happens randomly in batch inserting during regression test, which won't be fixed by the PR.

Anyway, meantime(before 0.2.5 release), if your work is just about query(instead of mutation/DDL), I'd suggest you to try the snapshot build, which works well in the past weeks at my end.

Issue with driver not honoring the server keep-alive timeout settings

4a430a0

pan3793 mentioned this pull request Nov 16, 2020

SocketBuffedWriter flushToTarget Broken Pipe housepower/ClickHouse-Native-JDBC#100

Closed

haiwenzhu mentioned this pull request Nov 25, 2020

ClickHouseRowBinaryStream.writeString throw ArrayIndexOutOfBoundsException #516

Closed

den-crane mentioned this pull request Nov 26, 2020

Broken pipe (Write failed) ClickHouse/ClickHouse#17446

Closed

amitsharma10 changed the base branch from master to develop December 7, 2020 05:10

den-crane mentioned this pull request Dec 27, 2020

ru.yandex.clickhouse.except.ClickHouseUnknownException: ClickHouse exception, code: 1002, host: xxx.xxx.xxx.xxx, port: 8123; xxx.xxx.xxx.xxx:8123 failed to respond #531

Closed

zhicwu merged commit 5e3f399 into ClickHouse:develop Dec 28, 2020

This was linked to issues Dec 29, 2020

Random ClickHouse exception, code: 1002 #478

Closed

NoHttpResponseException while intensive inserts #462

Closed

host failed to respond #452

Closed

zhicwu mentioned this pull request Feb 1, 2021

HTTP keep alive management #290

Closed

zhicwu mentioned this pull request Feb 10, 2021

Fix http connection reuse #505

Closed

mxzlxy mentioned this pull request Apr 30, 2021

Hive-Waterdrop-CK 导数问题复盘总结 cloudnativecube/octopus#55

Open

ryan-tu mentioned this pull request Nov 24, 2021

Validate stale connection to fix the bug: failed to respond #760

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with driver not honoring the server keep-alive timeout settings #515

Issue with driver not honoring the server keep-alive timeout settings #515

amitsharma10 commented Oct 30, 2020 •

edited

Loading

amitsharma10 commented Nov 11, 2020

zhicwu commented Dec 7, 2020

amitsharma10 commented Dec 7, 2020

amitsharma10 commented Dec 24, 2020

zhicwu commented Dec 28, 2020

den-crane commented Jan 6, 2021

zhicwu commented Jan 7, 2021

amitsharma10 commented Jan 7, 2021

zhicwu commented Jan 7, 2021 •

edited

Loading

gj-zhang commented Jan 8, 2021 •

edited

Loading

zhicwu commented Jan 8, 2021

gj-zhang commented Jan 11, 2021

gj-zhang commented Jan 20, 2021

zhicwu commented Jan 20, 2021 •

edited

Loading

Issue with driver not honoring the server keep-alive timeout settings #515

Issue with driver not honoring the server keep-alive timeout settings #515

Conversation

amitsharma10 commented Oct 30, 2020 • edited Loading

amitsharma10 commented Nov 11, 2020

zhicwu commented Dec 7, 2020

amitsharma10 commented Dec 7, 2020

amitsharma10 commented Dec 24, 2020

zhicwu commented Dec 28, 2020

den-crane commented Jan 6, 2021

zhicwu commented Jan 7, 2021

amitsharma10 commented Jan 7, 2021

zhicwu commented Jan 7, 2021 • edited Loading

gj-zhang commented Jan 8, 2021 • edited Loading

zhicwu commented Jan 8, 2021

gj-zhang commented Jan 11, 2021

gj-zhang commented Jan 20, 2021

zhicwu commented Jan 20, 2021 • edited Loading

amitsharma10 commented Oct 30, 2020 •

edited

Loading

zhicwu commented Jan 7, 2021 •

edited

Loading

gj-zhang commented Jan 8, 2021 •

edited

Loading

zhicwu commented Jan 20, 2021 •

edited

Loading