Expand previously-opened-element check to include all currently-open elements#4
Expand previously-opened-element check to include all currently-open elements#4willforde merged 4 commits intowillforde:masterfrom openculinary:issue-3/previously-closed-tag-check
Conversation
tests/test_module.py
Outdated
| assert "".join(body.itertext()) == "sample text content" | ||
|
|
||
| def test_text_iterator_unclosed_tag(): | ||
| html = "<html><body><span>hello <div>to <div>the <div>world!</span></body><footer>unrelated</footer></html>" |
There was a problem hiding this comment.
Note that this test coverage is intentionally designed not to make any assumptions about the tree/node structure built in-memory by the parser.. because I'm not sure about the implications yet.
(for example: div<div<div>>> (nested nodes) is one possible in-memory representation, and div<>div<>div<> (sibling nodes) is another) -- either of those produce the same itertext results in this test case, but it's possible to imagine representations where descendent text content would be different depending on the parse tree)
There was a problem hiding this comment.
...and it turns out, the tree structure produced by lxml when 'implied end tags' are generated depends on the elements encountered! (by design)
A series of unclosed div elements will turn into nested nodes, whereas a series of unclosed li items will result in sibling nodes.
It's all spec'd in the whatwg HTML5 spec, but it's kinda complicated involved (clarify: not complicated.. a lot of rules, though). Both the current node and the previous node are relevant when deciding how to emit implied closing tags, and there are custom rules for various different elements.
Uncertain what to do here - I like that this library is simple and fast. The reason I'm contributing is to see whether ScrapingHub's extruct library could use htmlement instead of lxml, allowing me to build pure-Python containers for openculinary.
Recent context here: scrapinghub/extruct#163 (comment) (and longer story earlier in that same issue)
In short: some of the test cases in extruct relate to tree-order traversal of parsed HTML nodes for text content retrieval, which is what led me to all this.
There was a problem hiding this comment.
Yeah that is a bit of an issue alright. While Htmlement's approach is fast, it's also very naive, deliberately so.
I can add more logic to catch some of the more common closing tag issues, but it would never be fully compliant.
|
Hello. This is a good change, thanks. There shouldn't be much of a performance hit with this change, as it shouldn't be hit too often. I didn't plan for more than one malformed end tag, that was a bit short sighted on my part. Thanks again. |
|
You're welcome, glad to contribute - this seemed to make one small step towards migration away from As commented above, I'm not sure whether that migration is practical or straightforward after all -- not a fault of |
Could resolve #3 - also has the potential to reduce performance, though.