We just shipped a slate of improvements to our article parsing engine, impacting over 100,000 pieces of saved content and resulting in a significantly more reliable reading experience
Parsing is the dark art of turning messy webpages into clean articles. It's the technical foundation that powers Matter's distraction-free reader mode.
Over the past month, we've improved our parsing system in ways big and small. Below is a sample, in no particular order. For more detail, check out this Twitter thread.
- We now handle Javascript and cookie requirements that previously broke parsing
- We now present subtitles for publishers like NYT, The Atlantic, WSJ, and The New Yorker
- We now handle Reddit posts
- We now handle LinkedIn posts
- We now parse "lazy loaded" images that require scrolling
- Misattributed publisher names have been systematically fixed
- Improved detection of publish date
- Improved Daring Fireball cosmetics
- We now handle multipage articles from MacStories
- Fixed an issue where italicized words in newsletters were sometimes concatenated
- Github repositories are now parsed and rendered based on their README files
- Improved handling of embedded Tweets in web articles
- Improved Tweet rendering for forwarded Substack newsletters
- Improved parsing for The Wall Street Journal (strip extraneous elements, don't drop images)
- Improved parsing for The Atlantic
- Substack footnotes no longer cause breaks
- Paginated Ars Technica articles are now properly parsed
- Posts from a16z are now properly parsed
- Fixed an issue where multiple nested figures caused images to be dropped
- Fixed an issue where title was sometimes duplicated in the article body of newsletters
- Most exciting, we've developed a new protocol for ongoing improvement.
Parsing is an ever-moving target, given the constantly evolving nature of the web. We're excited to continue pushing the state of the art forward!