The New York Times tried to block a web crawler that was affiliated with the famous Internet Archive, a project whose easy-to-use comparisons of article versions has sometimes led to embarrassment for the newspaper.
In 2021, the New York Times added “ia_archiver” — a bot that, in the past, captured huge numbers of websites for the Internet Archive — to a list that instructs certain crawlers to stay out of its website.
Crawlers are programs that work as automated bots to trawl websites, collecting data and sending it back to a repository, a process known as scraping. Such bots power search engines and the Internet Archive’s Wayback Machine, a service that facilitates the archiving and viewing of historic versions of websites going back to 1996.
The New York Times has, in the past, faced public criticisms over some of its stealth edits.
The Internet Archive’s Wayback Machine has long been used to compare webpages as they are updated over time, clearly delineating the differences between two iterations of any given page. Several years ago, the archive added a feature called “Changes” that lets users compare two archived versions of a website from different dates or times on a single display. The tool can be used to uncover changes in news stories that have been made without any accompanying editorial notes, so-called stealth edits.
The Times has, in the past, faced public criticisms over some of its stealth edits. In a notorious 2016 incident, the paper revised an article about then-Democratic presidential candidate Sen. Bernie Sanders, I-Vt., so drastically after publication — changing the tone from one of praise to skepticism — that it came in for a round of opprobrium from other outlets as well as the Times’s own public editor. The blogger who first noticed the revisions and set off the firestorm demonstrated the changes by using the Wayback Machine.
More recently, the Times stealth-edited an article that originally listed “death” as one of six ways “you can still cancel your federal student loan debt.” Following the edit, the “death” section title was changed to a more opaque heading of “debt won’t carry on.”
A service called NewsDiffs — which provides a similar comparative service but focuses on news outlets such as the New York Times, CNN, the Washington Post, and others — has also chronicled a long list of significant examples of articles that have undergone stealth edits, though the service appears to not have been updated in several years.
The New York Times declined to comment on why it is barring the ia_archiver bot from crawling its website.
Robots.txt Files
The mechanism that websites use to block certain crawlers is a robots.txt file. If website owners want to request that a particular search engine or other automated bot not scan their website, they can add the crawler’s name to the file, which the website owner then uploads to their site where it can be publicly accessed.
Based on a web standard known as the Robots Exclusion Protocol, a robots.txt file allows site owners to specify whether they want to allow a bot to crawl either part of or their whole websites. Though bots can always choose to ignore the presence of the file, many crawler services respect the requests.
The current robots.txt file on the New York Times’s website includes an instruction to disallow all site access to the ia_archiver bot.
The relationship between ia_archiver and the Internet Archive is not completely straightforward. While the Internet Archive crawls the web itself, it also receives data from other entities. Ia_archiver was, for more than a decade, a prolific supplier of website data to the archive.
The bot belonged to Alexa Internet, a web traffic analysis company co-founded by Brewster Kahle, who went on to create the Internet Archive right after Alexa. Alexa Internet went on to be acquired by Amazon in 1999 — its trademark name was later used for Amazon’s signature voice-activated assistant — and was eventually sunset in 2022.
Throughout its existence, Alexa Internet was intricately intertwined with the Internet Archive. From 1996 to the end of 2020, the Internet Archive received over 3 petabytes — more than 3,000 terabytes — of crawled website data from Alexa. Its role in helping to fill the archive with material led users to urge website owners not to block ia_archiver under the mistaken notion that it was unrelated to the Internet Archive.
As late as 2015, the Wayback Machine offered instructions for preventing a site from being ingested into the Wayback Machine — by using the site’s robots.txt file. News websites such as the Washington Post proceeded to take full advantage of this and disallowed the ia_archiver bot.
By 2017, however, the Internet Archive announced its intention to stop abiding by the dictates of a site’s robots.txt. While the Internet Archive had already been disregarding the robots.txt for military and government sites, the new update expanded the move to disregard robots.txt for all sites. Instead, website owners could make manual exclusion requests by email.
Reputation management firms, for one, are keenly aware of the change. The New York Times, too, appears to have mobilized the more selective manual exclusion process, as certain Times stories are not available via the Wayback Machine.
Some news sites such as the Washington Post have since removed ia_archiver from their list of blocked crawlers. While other websites removed their ia_archiver blocks, however, in 2021, the New York Times decided to add it.