Skip to content

Collapse space when implying value from textContents #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
njkleiner opened this issue May 28, 2020 · 6 comments
Closed

Collapse space when implying value from textContents #51

njkleiner opened this issue May 28, 2020 · 6 comments
Assignees
Labels
experimental Relates to experimental microformat parasing

Comments

@njkleiner
Copy link
Contributor

njkleiner commented May 28, 2020

Describe the bug

When implying the value property for a nested microformat (e.g., h-adr inside h-entry) from the HTML textContents, multiple successive whitespace characters should be collapsed to a single space character.

To Reproduce

HTML input:

<div class="h-entry">
    <span class="p-location h-adr">
        <span class="p-locality">Berlin</span>,
        <span class="p-region">Berlin</span>,
        <span class="p-country-name">DE</span>
        <data class="p-latitude" value="52.518606"></data>
        <data class="p-longitude" value="13.376127"></data>
    </span>
</div>

Expected behavior

Correct JSON output:

{
    "items": [
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "location": [
                    {
                        "type": [
                            "h-adr"
                        ],
                        "properties": {
                            "locality": [
                                "Berlin"
                            ],
                            "region": [
                                "Berlin"
                            ],
                            "country-name": [
                                "DE"
                            ],
                            "latitude": [
                                "52.518606"
                            ],
                            "longitude": [
                                "13.376127"
                            ]
                        },
                        "value": "Berlin, Berlin, DE"
                    }
                ]
            }
        }
    ],
    "rels": {},
    "rel-urls": {},
}

Actual JSON output:

{
  "rels": {},
  "rel-urls": {},
  "items": [
    {
      "type": [
        "h-entry"
      ],
      "properties": {
        "location": [
          {
            "type": [
              "h-adr"
            ],
            "properties": {
              "locality": [
                "Berlin"
              ],
              "region": [
                "Berlin"
              ],
              "country-name": [
                "DE"
              ],
              "latitude": [
                "52.518606"
              ],
              "longitude": [
                "13.376127"
              ]
            },
            "value": "Berlin,\n        Berlin,\n        DE"
          }
        ]
      }
    }
  ]
}

Note the difference Berlin, Berlin, DE vs. Berlin,\n Berlin,\n DE.

Additional context

From what I can tell, this is not actually part of the specification, it seems to be commonly accepted though, as both the PHP parser and the Python parser do this.

@njkleiner njkleiner added the bug Something isn't working label May 28, 2020
@aimee-gm aimee-gm added question Further information is requested and removed bug Something isn't working question Further information is requested labels May 29, 2020
@aimee-gm
Copy link
Member

Implementing this change causes several failures with the microformats test suite (https://github.com./microformats/tests/tree/master/tests/microformats-v2).

These tests also fail using the PHP parser.

I would like to use the microformats test suite as a "source of truth" for how a parser should work. I don't think this is a bug, but a behaviour change. It could be implemented behind an experimental toggle?

To change this parsing behaviour without an experimental toggle, this will either need to change the parsing specification or the test suite.

@aimee-gm aimee-gm added the experimental Relates to experimental microformat parasing label May 29, 2020
@aimee-gm
Copy link
Member

The PHP parser has an open PR to use the test suite, I would be interested in what changes are being made with the way they parse these test scenarios, if they do?

@njkleiner
Copy link
Contributor Author

Implementing this change causes several failures with the microformats test suite (microformats/tests:tests/microformats-v2@master).

I didn't realize that was the case. That's interesting considering that this behavior seems to be somewhat common.

I would like to use the microformats test suite as a "source of truth" for how a parser should work.

Definitely makes sense.

To change this parsing behaviour without an experimental toggle, this will either need to change the parsing specification or the test suite.

I've gone ahead and created a quick overview of how it's implemented across some projects. I think the pattern there is quite interesting.

Name Uses test suite Implements collapse whitespace
php-mf2 ✔️ (default)
mf2py ✔️ (default)
microformat-shiv ✔️ (feature)
micromicro ✔️
microformats-parser ✔️ ?

Given that it's the default behavior in some parsers and it's arguably useful, I think we should implement it (behind a feature flag, for now).

Also, I think there's definitely a discussion to be had about how this fits in with the official specification.

@Zegnat
Copy link
Member

Zegnat commented May 30, 2020

Background: we found that user expectations did not really match the parsing spec in all cases (e.g. when it comes to consequtive whitespace). This is being discussed as a spec issue.

PHP and Python both implement a version of an algorithm I wrote out. I am saying “a version of” as I am not actually completely sure on the details anymore and would not want to claim they match completely. (For even more complexity, there has been a try to find out what is needed to match browser specs more closely. Again in PHP and Python.)

The PHP parser has an open PR to use the test suite, I would be interested in what changes are being made with the way they parse these test scenarios, if they do?

I am cheating.

When running the tests from the test suite I default to the text logic that we had before the new whitespace patch landed. See commit microformats/php-mf2@4d46586. This basically reverts a commit made in March 2018, but only for the purpose of running the tests.

I hope that clears up some questions!

@aimee-gm
Copy link
Member

aimee-gm commented Jun 1, 2020

@njkleiner thank you for the comprehensive comparison of parsers for this issue 🙂 it's very helpful!

I have opened a draft pull (#52) request to add an experimental option to enable this.

At present, it only collapses whitespace in properties and values (it does not apply to rels, but I haven't though about if it should handle these yet), and does not do any of the whitespace algorithm described by @Zegnat - although I think this would be the way to go with this experimental option.

@aimee-gm aimee-gm self-assigned this Jun 7, 2020
@aimee-gm
Copy link
Member

aimee-gm commented Jun 8, 2020

@njkleiner with v1.4.0 there's now support for the textContent experimental flag that implements the improved text content handling.

I am considering how we can enable some of these experimental options, perhaps by default at some point.

@aimee-gm aimee-gm closed this as completed Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental Relates to experimental microformat parasing
Projects
None yet
Development

No branches or pull requests

3 participants