An interesting article I read recently, somewhat tangentially related to L&L.
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. Yet, their capabilities for HTML understanding – i.e., parsing the raw HTML of a webpage, with applications to automation of web-based tasks, crawling, and browser-assisted retrieval – have not been fully explored.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Description Generation for HTML inputs, and (iii) Autonomous Web Navigation of HTML pages.
…Out of the LLMs we evaluate, we show evidence that T5-based models are ideal due to their bidirectional encoder-decoder architecture. To promote further research on LLMs for HTML understanding, we create and open-source a large-scale HTML dataset distilled and auto-labeled from CommonCrawl.
This is more about HTML reading comprehension. For our interests, I suppose we would want the opposite, for it to generate HTML templates from human language input.
In general, I imagine it will become common to “write code” and develop websites by having a chat with an AI assistant and describing the result we want. But I wonder if such a conversational interface, by text or voice, is any faster or more intuitive than typing code on a keyboard, or “writing code” by GUI, through visual interface and interaction.
It could be that the best way is to integrate them into a multi-modal interface, so people who are more verbal thinkers can build a website by talking to it; visual thinkers can use mouse, touch, pen, to draw websites into existence; and programmers can write them as good ol’ code.
Speaking of GUI, I learned about structure/projectional editors, which sounded like a suitable design for a visual builder for L&L templates.
A structure editor is any document editor that is cognizant of the document’s underlying structure.
Structure editors can be used to edit hierarchical or marked up text, computer programs, diagrams, chemical formulas, and any other type of content with clear and well-defined structure. In contrast, a text editor is any document editor used for editing plain text files.
The Gutenberg block editor fits this description, but it’s too high level of an abstraction for L&L templates - we need more granularity and detail, to be able to edit every tag and attribute.
A projectional editor allows the user to edit the abstract syntax tree (AST) representation of code in an efficient way.
It can mimic the behavior of a textual editor for textual notations, a diagram editor for graphical languages, a tabular editor for editing tables and so on. The user interacts with the code through intuitive on-screen visuals which they can even switch between for multiple displays of the same code.
There’s a direct mapping between the textual and visual representations of code, and these editor modes provide different views into the same data structure.
we create and open-source a large-scale HTML dataset distilled and auto-labeled from CommonCrawl
Curious about this dataset mentioned in the article. Maybe we can use it to stress test and optimize L&L’s HTML parser. (CommonCrawl is an open repository of web crawl data.) Couldn’t find any links in the paper, but I did find discussion about it on GitHub - will keep an eye on it.