Skip to content

Adds std.parseYaml #339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
May 20, 2021
Merged

Adds std.parseYaml #339

merged 26 commits into from
May 20, 2021

Conversation

groodt
Copy link
Contributor

@groodt groodt commented Nov 24, 2019

Adds std.parseYaml to address the YAML aspect of: google/jsonnet#460

CPP jsonnet implemented here: google/jsonnet#888

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 77.818% when pulling f4d034e on groodt:groodt-parseYaml into 0959f85 on google:master.

@coveralls
Copy link

coveralls commented Nov 24, 2019

Coverage Status

Coverage decreased (-0.06%) to 81.715% when pulling 76fd830 on groodt:groodt-parseYaml into 7d81091 on google:master.

@groodt
Copy link
Contributor Author

groodt commented Nov 24, 2019

@googlebot I signed it!

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@sbarzowski
Copy link
Collaborator

Thanks!

I imagine this pulls in too many dependencies

It certainly seems so. Probably using go-yaml directly instead of k8s.io/apimachinery/pkg/util/yaml would help - which seems to be a wrapper around a wrapper (https://github.com./kubernetes-sigs/yaml) around go-yaml. We may also look into using https://github.com./kubernetes-sigs/yaml if that makes sense. But generally every additional level of indirection increases the chance of running into some unexpected magic.

I think the cpp-jsonnet implementation might be tricky. Would you vendor https://github.com./jbeder/yaml-cpp?

It's certainly harder than on the Go side. There are w few libraries for C++ available and yaml-cpp seems to be a pretty reasonable choice. But I have some reservations about vendoring the whole thing (as in, putting the whole thing in our source tree). And I also have reservations about requiring the existing users to provide a shared library dependency. I'm leaning towards a submodule and an option for CMake "USE_SYSTEM_YAML_CPP". It will require some discussion for sure.

@groodt
Copy link
Contributor Author

groodt commented Nov 30, 2019

@sbarzowski I've switched to https://github.com./kubernetes-sigs/yaml. It brings fewer dependencies along as you suggested.

How would you like the code structured? Into a yaml.go file?

@sparkprime
Copy link
Collaborator

Do we know if either of the two proposed implementations are fully compliant with the YAML standard?

@sbarzowski
Copy link
Collaborator

@sparkprime I don't think there exists such thing as "fully compliant YAML implementation". The underlying library in any case is https://github.com./go-yaml/yaml which is the primary implementation in Go which in turn should be reasonably compatible with libyaml.

The additional layers here are not about parsing YAML per se, but mapping between JSON and YAML (see: https://github.com./kubernetes-sigs/yaml/blob/master/yaml.go#L166). I am not aware of any standard for that other than common sense. We may want to standardize it on our side, though.

@sbarzowski
Copy link
Collaborator

How would you like the code structured? Into a yaml.go file?

Hmmm... sounds good. The builtinParseYAML function should stay in builtins and the helpers could go into yaml.go, that sounds good. We may consider moving other YAML-handling stuff (i.e. serialization there too).

@sbarzowski
Copy link
Collaborator

sbarzowski commented Dec 3, 2019

Re failing CI:

@sh0rez
Copy link
Contributor

sh0rez commented Dec 3, 2019

A good library for decoding yaml using the json unmarshaller (this is what kubernetes does) is https://github.com./ghodss/yaml

@sbarzowski
Copy link
Collaborator

sbarzowski commented Dec 3, 2019

If I understand the code correctly, Kubernetes actually uses https://github.com./kubernetes-sigs/yaml (same as we have in this change) which is a fork of https://github.com./ghodss/yaml

@groodt
Copy link
Contributor Author

groodt commented Dec 3, 2019

If I understand the code correctly, Kubernetes actually uses https://github.com./kubernetes-sigs/yaml (same as we have in this change) which is a fork of https://github.com./ghodss/yaml

This is correct I believe. Essentially, behind all the indirection and "Readers", the only useful piece of functionality in use here is this:
https://github.com./kubernetes-sigs/yaml/blob/master/yaml.go#L142

Perhaps future refactoring could even be directly implemented with go-yaml. kubernetes-sigs/yaml doesn't bring much baggage with it, so may not be worth it. Essentially any solution needs to read a YAML file as bytes or string, optionally split at the YAML stream separator (---), unmarshal these individual YAML documents to JSON and accumulate them into an array.

@groodt groodt changed the title Adds prototype std.parseYaml Adds std.parseYaml Dec 3, 2019
@groodt groodt changed the title Adds std.parseYaml Adds std.parseYaml Dec 3, 2019
@sparkprime
Copy link
Collaborator

Fun fact - ghodss was one of the first major Jsonnet adopters about 5 years ago.

re: YAML that is not valid JSON -- my first instinct is to reject it with an error, until we decide to broaden the Jsonnet data model to include such values (in which case we would error only if they were attempted to manifest to JSON).

So some custom logic that is very conservative and converts from "parsed YAML" to "parsed JSON or error" sounds good.

This also helps control divergence between platforms -- keep it simple, etc.

@groodt
Copy link
Contributor Author

groodt commented Aug 13, 2020

For a hypothetical CPP version of parseYAML, would integrating with Rapid YAML be acceptable?

It seems relatively easy to slot into the existing CMAKE as a git submodule? https://github.com./biojppm/rapidyaml#using-ryml-as-cmake-subproject

@sbarzowski
Copy link
Collaborator

sbarzowski commented Aug 13, 2020

Yes, I think Rapid YAML would be quite a good choice. It is fast and seems to have a reasonable coverage of YAML.

We will need to declare compatibility for parseYAML as "best effort" anyway. It's not possible to cover all the boundary cases. The only alternative I see is to define a supported subset of YAML and validate that, which would be a pretty big project by itself.

In C++ we will need to support not only CMake, but also pure Makefile and Bazel.

@tomwilkie
Copy link
Contributor

@sparkprime wondering if there is anything I can do to help unblock this? We're using lots of YAML in the monitoring mixins work, and right now tanka and mixtool have to add a native function.

(Also - hey! hows it going? been a while :- )

.gitmodules Outdated
@@ -1,3 +1,4 @@
[submodule "cpp-jsonnet"]
path = cpp-jsonnet
url = https://github.com./google/jsonnet.git
url = https://github.com./groodt/jsonnet.git
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing purposes. Can be reverted when CPP implementation is merged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this comment open until then. We should definitely revert to the official one before merging.

@groodt
Copy link
Contributor Author

groodt commented Feb 22, 2021

There is now broadly parity between the Go and CPP variants of std.parseYaml.

The CPP PR is ready for review here:
google/jsonnet#888

@groodt
Copy link
Contributor Author

groodt commented Feb 22, 2021

PTAL @sbarzowski

Copy link
Collaborator

@sbarzowski sbarzowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'm happy with merging this once (a) C++ version is ready (b) we resolve the questions about streams.

.gitmodules Outdated
@@ -1,3 +1,4 @@
[submodule "cpp-jsonnet"]
path = cpp-jsonnet
url = https://github.com./google/jsonnet.git
url = https://github.com./groodt/jsonnet.git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this comment open until then. We should definitely revert to the official one before merging.

}
s := sval.getGoString()

isYamlStream := strings.Contains(s, "---")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can get bitten by some boundary case here. Can it be a part of a multi-line comment? How other implementations handle that?

Copy link
Contributor Author

@groodt groodt Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shotgun parsing is a bit gross, I agree. Anyone adding those characters to a YAML document will be in for a difficult time if they aren't escaping it thoroughly. The .Contains check can probably be made more explicit to check for start of line or end of line characters if that makes you feel more comfortable?

Alternatively, we can adjust the user facing API. The only reason for "sniffing" whether a document is a "yaml stream" or not is to be perhaps more user-friendly by converting a single YAML document to an object {} and a stream of YAML documents into an array []. An alternative user API could be to require a user to choose. e.g. parseYaml that returns a single object {} and parseYamlStream which returns an array.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think you could achieve the same effect in a cleaner way just by checking len(elems). The elems will only have multiple elements when there are multiple documents. If it's an array, it will still be just one document. Is that correct?

An alternative user API could be to require a user to choose. e.g. parseYaml that returns a single object {} and parseYamlStream which returns an array.

That could work. It's more complicated, but cleaner in a way. I don't have a strong opinion. We can also add parseYamlStream later (which forces returning of an array, even if there is only one document).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think you could achieve the same effect in a cleaner way just by checking len(elems).

How do we differentiate between a single YAML document with a scalar list vs a YAML stream of documents?

YAML, like JSON, can have a top level sequence or array. Some examples.

// Single doc, scalar array at root
std.parseYaml(
|||
  - {a: 1, b: 2}
  - {a: 3, b: 4}
  - {a: 5, b: 6}
|||
)

[
   {
      "a": 1,
      "b": 2
   },
   {
      "a": 3,
      "b": 4
   },
   {
      "a": 5,
      "b": 6
   }
]
// Mutli doc YAML stream with document start separators.
std.parseYaml(
  |||
   ---
   {a: 1, b: 2}
   ---
   {a: 3, b: 4}
   ---
   {a: 5, b: 6}
  |||
)

[
   {
      "a": 1,
      "b": 2
   },
   {
      "a": 3,
      "b": 4
   },
   {
      "a": 5,
      "b": 6
   }
]
// Single doc, scalar array at root. Indentation instead of {
std.parseYaml(
|||
  - a: 1
    b: 2
  - a: 3
    b: 4
  - a: 5
    b: 6
|||
)

[
   {
      "a": 1,
      "b": 2
   },
   {
      "a": 3,
      "b": 4
   },
   {
      "a": 5,
      "b": 6
   }
]

Since there can be scalar arrays at the root, checking the len(elems) isn't enough because it isn't possible to tell the difference between a yaml stream with 1 document len(elems) == 1 and a single document with a top-level scalar array of length 1.

The approach used in the current implementation assumes that if you specify a YAML stream using --- you know you will receive an array of 1 or more documents. If you do not specify a YAML stream, you will get whatever YAML object you specified, either a Map or a Seq of length, 0, 1 or more elements.

YAML is complicated! :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there can be scalar arrays at the root, checking the len(elems) isn't enough because it isn't possible to tell the difference between a yaml stream with 1 document len(elems) == 1 and a single document with a top-level scalar array of length 1.

Please correct me if I misunderstand something. I assumed that d.Decode(&elem) reads one document. So len(elems) will be the number of documents in a stream. If it's only one document (even if it's an array) it would read the whole thing. Is that correct?

Is your point that:

---
foo: bar

and

foo: bar

both produce {foo: "bar"}, but the user might have expected the array in the first case? Hnnnn... I think it's a valid concern, because when processing streams, it makes it hard to handle special case of 1-element stream.

Sniffing for --- still seems pretty bad (even if we made sure that it's a line on its own). First, I'm afraid of special cases. Second, IIUC formally it's not a magic symbol for streams, but an explicit start of the document.

Perhaps an explicit parseYamlStream which always returns the array is the way out of this.

@sparkprime @sh0rez Any thoughts about handling YAML streams?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both produce {foo: "bar"}, but the user might have expected the array in the first case? Hnnnn... I think it's a valid concern, because when processing streams, it makes it hard to handle special case of 1-element stream.

Yes, that's my concern. You are correct in that it attempts to read a single document, but should we treat a 1-element stream differently or not is the question? Should the output of the following 2 examples be different or the same?

std.parseYaml(
  |||
   ---
   foo: bar
  |||
)
std.parseYaml(
|||
   foo: bar
|||
)

What about this one?

std.parseYaml(
  |||
   ---
   foo: bar
   ---
   wibble: wobble
  |||
)
std.parseYaml(
|||
  - foo: bar
  - wibble: wobble
|||
)

I think I can be convinced either way to be honest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants