Measuring and understanding the topology of the web

Headlessly downloading webpages is a common and useful mechanism in many measurement projects. Such a basic task would seem to require little consideration. Indeed, most prior work of which we are aware chooses a relatively basic tool (like Selenium or Puppeteer) and assumes that downloading a page once yields all of its content—which may work well for static content, but not for dynamic webpages with third-party content.

This work empirically establishes sound methods for downloading webpages. We scan the Alexa top-10,000 most popular websites (and other, less popular sites) with different combinations of tools and reloading strategies. Surprisingly, we find that even sophisticated tools (like Crawlium and ZBrowse) do not get all resources or links alone, and that downloading a page even dozens of times can miss a significant portion of content. We investigate these differences and find that they are, surprisingly, not strictly due to ephemeral content like ads. We conclude with recommendations for how future measurement efforts should download webpages, and what they should report on.

Code and data

Code and data are available on our GitHub repo

Papers

To appear Sound Methodology for Downloading Webpages
Soumya Indela, Dave Levin
TMA 2021 (Network Traffic Measurement and Analysis Conference)

People

  • Soumya Indela
  • Dave Levin