-
Notifications
You must be signed in to change notification settings - Fork 240
Adds std.parseYaml #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds std.parseYaml #339
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
@googlebot I signed it! |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
Thanks!
It certainly seems so. Probably using
It's certainly harder than on the Go side. There are w few libraries for C++ available and yaml-cpp seems to be a pretty reasonable choice. But I have some reservations about vendoring the whole thing (as in, putting the whole thing in our source tree). And I also have reservations about requiring the existing users to provide a shared library dependency. I'm leaning towards a submodule and an option for CMake "USE_SYSTEM_YAML_CPP". It will require some discussion for sure. |
@sbarzowski I've switched to How would you like the code structured? Into a |
Do we know if either of the two proposed implementations are fully compliant with the YAML standard? |
@sparkprime I don't think there exists such thing as "fully compliant YAML implementation". The underlying library in any case is https://github.com./go-yaml/yaml which is the primary implementation in Go which in turn should be reasonably compatible with libyaml. The additional layers here are not about parsing YAML per se, but mapping between JSON and YAML (see: https://github.com./kubernetes-sigs/yaml/blob/master/yaml.go#L166). I am not aware of any standard for that other than common sense. We may want to standardize it on our side, though. |
Hmmm... sounds good. The builtinParseYAML function should stay in builtins and the helpers could go into yaml.go, that sounds good. We may consider moving other YAML-handling stuff (i.e. serialization there too). |
Re failing CI:
|
A good library for decoding yaml using the json unmarshaller (this is what kubernetes does) is https://github.com./ghodss/yaml |
If I understand the code correctly, Kubernetes actually uses https://github.com./kubernetes-sigs/yaml (same as we have in this change) which is a fork of https://github.com./ghodss/yaml |
This is correct I believe. Essentially, behind all the indirection and "Readers", the only useful piece of functionality in use here is this: Perhaps future refactoring could even be directly implemented with |
Fun fact - ghodss was one of the first major Jsonnet adopters about 5 years ago. re: YAML that is not valid JSON -- my first instinct is to reject it with an error, until we decide to broaden the Jsonnet data model to include such values (in which case we would error only if they were attempted to manifest to JSON). So some custom logic that is very conservative and converts from "parsed YAML" to "parsed JSON or error" sounds good. This also helps control divergence between platforms -- keep it simple, etc. |
For a hypothetical CPP version of parseYAML, would integrating with Rapid YAML be acceptable? It seems relatively easy to slot into the existing CMAKE as a git submodule? https://github.com./biojppm/rapidyaml#using-ryml-as-cmake-subproject |
Yes, I think Rapid YAML would be quite a good choice. It is fast and seems to have a reasonable coverage of YAML. We will need to declare compatibility for parseYAML as "best effort" anyway. It's not possible to cover all the boundary cases. The only alternative I see is to define a supported subset of YAML and validate that, which would be a pretty big project by itself. In C++ we will need to support not only CMake, but also pure Makefile and Bazel. |
@sparkprime wondering if there is anything I can do to help unblock this? We're using lots of YAML in the monitoring mixins work, and right now tanka and mixtool have to add a native function. (Also - hey! hows it going? been a while :- ) |
.gitmodules
Outdated
@@ -1,3 +1,4 @@ | |||
[submodule "cpp-jsonnet"] | |||
path = cpp-jsonnet | |||
url = https://github.com./google/jsonnet.git | |||
url = https://github.com./groodt/jsonnet.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For testing purposes. Can be reverted when CPP implementation is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this comment open until then. We should definitely revert to the official one before merging.
There is now broadly parity between the Go and CPP variants of The CPP PR is ready for review here: |
PTAL @sbarzowski |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I'm happy with merging this once (a) C++ version is ready (b) we resolve the questions about streams.
.gitmodules
Outdated
@@ -1,3 +1,4 @@ | |||
[submodule "cpp-jsonnet"] | |||
path = cpp-jsonnet | |||
url = https://github.com./google/jsonnet.git | |||
url = https://github.com./groodt/jsonnet.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this comment open until then. We should definitely revert to the official one before merging.
} | ||
s := sval.getGoString() | ||
|
||
isYamlStream := strings.Contains(s, "---") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can get bitten by some boundary case here. Can it be a part of a multi-line comment? How other implementations handle that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shotgun parsing is a bit gross, I agree. Anyone adding those characters to a YAML document will be in for a difficult time if they aren't escaping it thoroughly. The .Contains
check can probably be made more explicit to check for start of line or end of line characters if that makes you feel more comfortable?
Alternatively, we can adjust the user facing API. The only reason for "sniffing" whether a document is a "yaml stream" or not is to be perhaps more user-friendly by converting a single YAML document to an object {}
and a stream of YAML documents into an array []
. An alternative user API could be to require a user to choose. e.g. parseYaml
that returns a single object {}
and parseYamlStream
which returns an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think you could achieve the same effect in a cleaner way just by checking len(elems)
. The elems will only have multiple elements when there are multiple documents. If it's an array, it will still be just one document. Is that correct?
An alternative user API could be to require a user to choose. e.g. parseYaml that returns a single object {} and parseYamlStream which returns an array.
That could work. It's more complicated, but cleaner in a way. I don't have a strong opinion. We can also add parseYamlStream later (which forces returning of an array, even if there is only one document).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think you could achieve the same effect in a cleaner way just by checking
len(elems)
.
How do we differentiate between a single YAML document with a scalar list vs a YAML stream of documents?
YAML, like JSON, can have a top level sequence or array. Some examples.
// Single doc, scalar array at root
std.parseYaml(
|||
- {a: 1, b: 2}
- {a: 3, b: 4}
- {a: 5, b: 6}
|||
)
[
{
"a": 1,
"b": 2
},
{
"a": 3,
"b": 4
},
{
"a": 5,
"b": 6
}
]
// Mutli doc YAML stream with document start separators.
std.parseYaml(
|||
---
{a: 1, b: 2}
---
{a: 3, b: 4}
---
{a: 5, b: 6}
|||
)
[
{
"a": 1,
"b": 2
},
{
"a": 3,
"b": 4
},
{
"a": 5,
"b": 6
}
]
// Single doc, scalar array at root. Indentation instead of {
std.parseYaml(
|||
- a: 1
b: 2
- a: 3
b: 4
- a: 5
b: 6
|||
)
[
{
"a": 1,
"b": 2
},
{
"a": 3,
"b": 4
},
{
"a": 5,
"b": 6
}
]
Since there can be scalar arrays at the root, checking the len(elems)
isn't enough because it isn't possible to tell the difference between a yaml stream with 1 document len(elems) == 1
and a single document with a top-level scalar array of length 1.
The approach used in the current implementation assumes that if you specify a YAML stream using ---
you know you will receive an array of 1 or more documents. If you do not specify a YAML stream, you will get whatever YAML object you specified, either a Map or a Seq of length, 0, 1 or more elements.
YAML is complicated! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there can be scalar arrays at the root, checking the len(elems) isn't enough because it isn't possible to tell the difference between a yaml stream with 1 document len(elems) == 1 and a single document with a top-level scalar array of length 1.
Please correct me if I misunderstand something. I assumed that d.Decode(&elem)
reads one document. So len(elems) will be the number of documents in a stream. If it's only one document (even if it's an array) it would read the whole thing. Is that correct?
Is your point that:
---
foo: bar
and
foo: bar
both produce {foo: "bar"}
, but the user might have expected the array in the first case? Hnnnn... I think it's a valid concern, because when processing streams, it makes it hard to handle special case of 1-element stream.
Sniffing for ---
still seems pretty bad (even if we made sure that it's a line on its own). First, I'm afraid of special cases. Second, IIUC formally it's not a magic symbol for streams, but an explicit start of the document.
Perhaps an explicit parseYamlStream
which always returns the array is the way out of this.
@sparkprime @sh0rez Any thoughts about handling YAML streams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both produce
{foo: "bar"}
, but the user might have expected the array in the first case? Hnnnn... I think it's a valid concern, because when processing streams, it makes it hard to handle special case of 1-element stream.
Yes, that's my concern. You are correct in that it attempts to read a single document, but should we treat a 1-element stream differently or not is the question? Should the output of the following 2 examples be different or the same?
std.parseYaml(
|||
---
foo: bar
|||
)
std.parseYaml(
|||
foo: bar
|||
)
What about this one?
std.parseYaml(
|||
---
foo: bar
---
wibble: wobble
|||
)
std.parseYaml(
|||
- foo: bar
- wibble: wobble
|||
)
I think I can be convinced either way to be honest.
Adds
std.parseYaml
to address the YAML aspect of: google/jsonnet#460CPP jsonnet implemented here: google/jsonnet#888