LLMS Central - The Robots.txt for AI
Web Crawling

Show HN: HTML to Markdown with CSS selector and XPath annotations

Github.com2 min read
Share:
Show HN: HTML to Markdown with CSS selector and XPath annotations

Original Article Summary

HTML-to-Markdown converters produce clean, readable content for both humans and LLMs — but the DOM structure is lost along the way. You can always feed Markdown to an LLM to extract structured information, but that costs tokens on every page, every time.What …

Read full article at Github.com

Our Analysis

Lightfeed's introduction of Scrapedown, an HTML to Markdown converter with CSS selector and XPath annotations, allows for the preservation of DOM structure in the conversion process. This development is significant, as it enables the creation of clean, readable content for both humans and Large Language Models (LLMs) without losing the underlying structure of the HTML document. For website owners, this means that they can now leverage Scrapedown to convert their HTML content into Markdown while retaining the annotated structure, which can be particularly useful for tracking AI bot traffic and managing llms.txt files. By preserving the DOM structure, website owners can better understand how LLMs interact with their content and make more informed decisions about their AI content policies. To take advantage of this development, website owners can follow these actionable tips: (1) integrate Scrapedown into their content workflow to convert HTML to Markdown with preserved DOM structure, (2) utilize the annotated structure to refine their llms.txt files and improve AI bot tracking, and (3) monitor the impact of Scrapedown on their website's interaction with LLMs to optimize their content strategy and reduce token costs.

Related Topics

Bots

Track AI Bots on Your Website

See which AI crawlers like ChatGPT, Claude, and Gemini are visiting your site. Get real-time analytics and actionable insights.

Start Tracking Free →