Skip to content

feat(experimental): begin support for text content whitespace handling #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 7, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Follows the [microformats2 parsing specification](http://microformats.org/wiki/m
- [Microformats v2](#microformats-v2)
- [Experimental options](#experimental-options)
- [`lang`](#lang)
- [`textContent`](#textcontent)
- [Contributing](#contributing)

## Quick start
Expand Down Expand Up @@ -78,6 +79,7 @@ Use: `mf2(html: string, options: { baseUrl: string, experimental: object })`
- `baseUrl` (string, required) - a base URL to resolve relative URLs
- `experimental` (object, optional) - experimental (non-standard) options
- `lang` (boolean, optional) - enable support for parsing `lang` attributes
- `textContent` (boolean, optional) - enable support for better collapsing whitespace in text content.

Returns the parsed microformats from the HTML string

Expand All @@ -89,7 +91,7 @@ This package will parse microformats v1, however support will be limited to the

### Microformats v2

We provide support for all mircroformats v2 parsing, as detailed in the [microformats2 parsing specification](http://microformats.org/wiki/microformats2-parsing). If there is an issue with v2 parsing, please create an issue.
We provide support for all microformats v2 parsing, as detailed in the [microformats2 parsing specification](http://microformats.org/wiki/microformats2-parsing). If there is an issue with v2 parsing, please create an issue.

### Experimental options

Expand All @@ -103,6 +105,10 @@ Parse microformats for `lang` attributes. This will include `lang` on microforma

These are sourced from the element themselves, a parent microformat, the HTML document or a meta tag.

#### `textContent`

When parsing microformats for text content, all the consecutive whitespace is collapsed into a single space. `<br/>` and `<p>` tags are treated as line breaks.

## Contributing

See our [contributing guidelines](./CONTRIBUTING.md) for more information.
3 changes: 2 additions & 1 deletion demo/demo.js
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ window.parseHtml = () => {
const html = document.getElementById("html").value;
const baseUrl = document.getElementById("base-url").value;
const lang = document.getElementById("lang").checked;
const textContent = document.getElementById("textContent").checked;

return parse(html, { baseUrl, experimental: { lang } });
return parse(html, { baseUrl, experimental: { lang, textContent } });
};
10 changes: 10 additions & 0 deletions demo/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,16 @@ <h3>Experimental options</h3>
<input type="checkbox" name="lang" id="lang" value="true" checked />
<span>Language parsing</span>
</label>
<label>
<input
type="checkbox"
name="textContent"
id="textContent"
value="true"
checked
/>
<span>Better text content</span>
</label>
</p>

<div class="submit">
Expand Down
9 changes: 5 additions & 4 deletions src/helpers/documentSetup.ts
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@ export const findBase = (node: DefaultTreeElement): string | undefined => {

const handleNode = (
node: DefaultTreeElement,
result: DocumentSetupResult
result: DocumentSetupResult,
options: ParserOptions
): void => {
for (const i in node.childNodes) {
const child = node.childNodes[i];
Expand Down Expand Up @@ -97,13 +98,13 @@ const handleNode = (
}

if (isRel(child)) {
parseRel(child, result);
parseRel(child, result, options);
}

/**
* Repeat this process for this node's children
*/
handleNode(child, result);
handleNode(child, result, options);
}
};

Expand All @@ -119,7 +120,7 @@ export const documentSetup = (
lang: undefined,
};

handleNode(node, result);
handleNode(node, result, options);

return result;
};
12 changes: 12 additions & 0 deletions src/helpers/experimental.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
import { ExperimentalName, ParserOptions } from "../types";

export const isEnabled = (
options: ParserOptions,
flag: ExperimentalName
): boolean => {
if (!options || !options.experimental) {
return false;
}

return options.experimental[flag] || false;
};
87 changes: 79 additions & 8 deletions src/helpers/textContent.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@ import { DefaultTreeNode, DefaultTreeElement } from "parse5";

import { getAttributeValue } from "./attributes";
import { isElement, isTextNode } from "./nodeMatchers";
import { ParserOptions } from "../types";
import { isEnabled } from "./experimental";

const imageValue = (node: DefaultTreeElement): string | undefined =>
getAttributeValue(node, "alt")?.trim() ??
getAttributeValue(node, "src")?.trim();

const walk = (current: string, node: DefaultTreeNode): string => {
/* istanbul ignore else */
Expand All @@ -11,8 +17,7 @@ const walk = (current: string, node: DefaultTreeNode): string => {
}

if (node.tagName === "img") {
const value =
getAttributeValue(node, "alt") ?? getAttributeValue(node, "src");
const value = imageValue(node);

if (value) {
return `${current} ${value} `;
Expand Down Expand Up @@ -49,11 +54,77 @@ const impliedWalk = (current: string, node: DefaultTreeNode): string => {
return current;
};

export const textContent = (node: DefaultTreeElement): string =>
node.childNodes.reduce<string>(walk, "").trim();
const experimentalWalk = (current: string, node: DefaultTreeNode): string => {
/* istanbul ignore else */
if (isElement(node)) {
if (["style", "script"].includes(node.tagName)) {
return current;
}

export const impliedTextContent = (node: DefaultTreeElement): string =>
node.childNodes.reduce<string>(impliedWalk, "").trim();
if (node.tagName === "img") {
const value = imageValue(node);

export const relTextContent = (node: DefaultTreeElement): string =>
node.childNodes.reduce<string>(impliedWalk, "");
if (value) {
return `${current} ${value} `;
}
}

if (node.tagName === "br") {
return `${current}\n`;
}

if (node.tagName === "p") {
return node.childNodes.reduce<string>(experimentalWalk, `${current}\n`);
}

return node.childNodes.reduce<string>(experimentalWalk, current);
} else if (isTextNode(node)) {
const value = node.value.replace(/[\t\n\r]/g, " ");
if (value) {
return `${current}${value}`;
}
}

/* istanbul ignore next */
return current;
};

const experimentalTextContent = (node: DefaultTreeElement): string =>
node.childNodes
.reduce<string>(experimentalWalk, "")
.replace(/ +/g, " ")
.replace(/ ?\n ?/g, "\n")
.trim();

export const textContent = (
node: DefaultTreeElement,
options: ParserOptions
): string => {
if (isEnabled(options, "textContent")) {
return experimentalTextContent(node);
}

return node.childNodes.reduce<string>(walk, "").trim();
};

export const impliedTextContent = (
node: DefaultTreeElement,
options: ParserOptions
): string => {
if (isEnabled(options, "textContent")) {
return experimentalTextContent(node);
}

return node.childNodes.reduce<string>(impliedWalk, "").trim();
};

export const relTextContent = (
node: DefaultTreeElement,
options: ParserOptions
): string => {
if (isEnabled(options, "textContent")) {
return experimentalTextContent(node);
}

return node.childNodes.reduce<string>(impliedWalk, "");
};
10 changes: 6 additions & 4 deletions src/helpers/valueClassPattern.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import { getAttributeValue, hasClassName } from "./attributes";
import { textContent } from "./textContent";
import { findChildren } from "./findChildren";
import { isValueClass } from "./nodeMatchers";
import { ParsingOptions } from "../types";

interface Options {
datetime: boolean;
Expand Down Expand Up @@ -54,23 +55,24 @@ const handleDate = (dateStrings: string[]): string | undefined =>

export const valueClassPattern = (
node: DefaultTreeElement,
{ datetime }: Partial<Options> = {}
options: ParsingOptions & Partial<Options>
): string | undefined => {
const values = findChildren(node, isValueClass);

if (!values.length) {
return;
}

if (datetime) {
if (options.datetime) {
const date = values.map(
(node) => datetimeProp(node) ?? valueTitle(node) ?? textContent(node)
(node) =>
datetimeProp(node) ?? valueTitle(node) ?? textContent(node, options)
);
return handleDate(date);
}

return values
.map((node) => valueTitle(node) ?? textContent(node))
.map((node) => valueTitle(node) ?? textContent(node, options))
.join("")
.trim();
};
6 changes: 4 additions & 2 deletions src/implied/name.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ import { DefaultTreeElement } from "parse5";
import { impliedTextContent } from "../helpers/textContent";
import { isElement } from "../helpers/nodeMatchers";
import { getClassNames, getAttributeIfTag } from "../helpers/attributes";
import { ParsingOptions } from "../types";

const parseNode = (node: DefaultTreeElement): string | undefined =>
getAttributeIfTag(node, ["img", "area"], "alt") ??
Expand All @@ -20,7 +21,8 @@ const parseGrandchild = (node: DefaultTreeElement): string | undefined => {

export const impliedName = (
node: DefaultTreeElement,
children: DefaultTreeElement[]
children: DefaultTreeElement[],
options: ParsingOptions
): string | undefined => {
if (children.some((child) => getClassNames(child, /^(p|e|h)-/).length)) {
return;
Expand All @@ -30,6 +32,6 @@ export const impliedName = (
parseNode(node) ??
parseChild(node) ??
parseGrandchild(node) ??
impliedTextContent(node)
impliedTextContent(node, options)
);
};
1 change: 1 addition & 0 deletions src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ export interface Options {
baseUrl: string;
experimental?: {
lang?: boolean;
textContent?: boolean;
};
}

Expand Down
8 changes: 5 additions & 3 deletions src/microformats/parse.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import {
} from "../backcompat";
import { applyIncludesToRoot } from "../helpers/includes";
import { parseE } from "./property";
import { isEnabled } from "../helpers/experimental";

interface ParseMicroformatOptions extends ParsingOptions {
valueType?: PropertyType;
Expand Down Expand Up @@ -58,7 +59,7 @@ export const parseMicroformat = (
item.id = id;
}

if (options.experimental?.lang && lang) {
if (isEnabled(options, "lang") && lang) {
item.lang = lang;
}

Expand All @@ -72,12 +73,13 @@ export const parseMicroformat = (
item.value =
(item.properties.name && item.properties.name[0]) ??
getAttributeValue(node, "title") ??
textContent(node);
textContent(node, options);
}

if (options.valueType === "u") {
item.value =
(item.properties.url && item.properties.url[0]) ?? textContent(node);
(item.properties.url && item.properties.url[0]) ??
textContent(node, options);
}

/**
Expand Down
2 changes: 1 addition & 1 deletion src/microformats/properties.ts
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ export const microformatProperties = (
if (typeof properties.name === "undefined") {
addProperty(properties, {
key: "name",
value: impliedName(node, propertyNodes),
value: impliedName(node, propertyNodes, options),
});
}

Expand Down
Loading