Headlessly downloading webpages is a common and useful mechanism in many measurement projects. Such a basic task would seem to require little consideration. Indeed, most prior work of which we are aware chooses a relatively basic tool (like Selenium or Puppeteer) and assumes that downloading a page once yields all of its content—which may work well for static content, but not for dynamic webpages with third-party content.
This work empirically establishes sound methods for downloading webpages. We scan the Alexa top-10,000 most popular websites (and other, less popular sites) with different combinations of tools and reloading strategies. Surprisingly, we find that even sophisticated tools (like Crawlium and ZBrowse) do not get all resources or links alone, and that downloading a page even dozens of times can miss a significant portion of content. We investigate these differences and find that they are, surprisingly, not strictly due to ephemeral content like ads. We conclude with recommendations for how future measurement efforts should download webpages, and what they should report on.
To appear |
Sound Methodology for Downloading Webpages Soumya Indela, Dave Levin TMA 2021 (Network Traffic Measurement and Analysis Conference) |