We just shipped a slate of improvements to our article parsing engine, impacting over 100,000 pieces of saved content and resulting in a significantly more reliable reading experience

Parsing is the dark art of turning messy webpages into clean articles. It's the technical foundation that powers Matter's distraction-free reader mode.

Over the past month, we've improved our parsing system in ways big and small. Below is a sample, in no particular order. For more detail, check out this Twitter thread.

  • We now handle Javascript and cookie requirements that previously broke parsing
  • We now present subtitles for publishers like NYT, The Atlantic, WSJ, and The New Yorker
  • We now handle Reddit posts
  • We now handle LinkedIn posts
  • We now parse "lazy loaded" images that require scrolling
  • Misattributed publisher names have been systematically fixed
  • Improved detection of publish date
  • Improved Daring Fireball cosmetics
  • We now handle multipage articles from MacStories
  • Fixed an issue where italicized words in newsletters were sometimes concatenated
  • Github repositories are now parsed and rendered based on their README files
  • Improved handling of embedded Tweets in web articles
  • Improved Tweet rendering for forwarded Substack newsletters
  • Improved parsing for The Wall Street Journal (strip extraneous elements, don't drop images)
  • Improved parsing for The Atlantic
  • Substack footnotes no longer cause breaks
  • Paginated Ars Technica articles are now properly parsed
  • Posts from a16z are now properly parsed
  • Fixed an issue where multiple nested figures caused images to be dropped
  • Fixed an issue where title was sometimes duplicated in the article body of newsletters
  • Most exciting, we've developed a new protocol for ongoing improvement.

Parsing is an ever-moving target, given the constantly evolving nature of the web. We're excited to continue pushing the state of the art forward!