arrow_back Back to Blog

NO (Mostly)! What Terms of Use For Major Websites Say About Whether Generative AI Training Is Allowed On Their Content

TL;DR: Generative AI builders are getting sued for (among other things) breaching website terms of use. We examined 43 relevant websites to see whether their terms of use might have been breached by generative AI training. A heavy majority explicitly or implicitly prohibit use for generative AI training.

Shrek movie - Welcome to Duloc - Here we have some rules
Shrek movie - Welcome to Duloc song

Generative AIs face significant legal risks. They were trained on vast amounts of data (text, images, code), often without permission from rights holders. Maybe courts will find this okay, maybe they won’t. (For >3k more words on this, check out our earlier (well-reviewed!) piece about legal risks faced by generative AI.) If courts find generative AIs offside, it will likely be because of:

  1. copyright infringement or
  2. breach of website terms of use

Breach of website terms of use claims center on how generative AIs used website content to train on despite the website terms of use explicitly restricting how their content is used. As such, these cases are going to highly depend on what specific website terms of use actually say. For a breach of website terms of use claim to succeed, the following will have to be true:

  1. The website will need to have terms of use. Not all websites have terms of use.
  2. The terms of use will need to restrict using the website in ways consistent with how generative AIs were trained.¹

In this piece, we look at the terms of use for a bunch of websites that likely contributed training data to generative AI training sets, to get a rough sense of how these claims might work out. This piece examines terms of use for 43 websites, which we suspect were among the main (inadvertent) contributors to generative AI training sets. Nearly all have terms of use, and a heavy majority of these restrict using their content to train a generative AI.

Here’s how this (long) piece is laid out:

  • Quickly covers what website terms of use are, and explains how they might restrict generative AI training.
    • Includes a widget where you can upload terms of use we didn’t cover and uses AI to show you relevant information.
  • Has a summary chart detailing findings.
  • Then presents relevant sections from the reviewed terms of use in greater detail.
  • Finishes by covering some relevant background information, like whether browsewrap contracts are valid.

Feel free to jump around. If you skip or skim the details of the underlying terms of use, you should be able to make it through the piece pretty quickly.

What Are Website Terms Of Use And How Might They Restrict Generative AI Training?

New York Times website footerInsider website footer

Website terms of use set out the terms and conditions under which visitors can use a website. For example, a news website might generally say “Subscribers are welcome to come visit our website to read the news for personal use, but they can’t copy content from here and re-post it elsewhere.” Website terms of use might allow crawling for search engine link purposes, but might not allow use of their content for other purposes beyond personal use. Importantly for our purposes, restrictions might explicitly or implicitly restrict using their content as generative AI training data (though explicit AI-training restrictions are probably new).

Like to see generative AI training restrictions in a contract that’s more relevant to you than the ones we covered below? Upload your contract to the widget below, and our AI will automatically find these provisions in it for you.

By uploading a document, you agree to the Zuva Terms and Conditions. We delete your documents and extractions when your user session ends. This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply. The Google terms do not apply to your document.

What Website Terms of Use Say About Being Used For Things Like Generative AI Training

Summary

Below is the summary of the information that we found. All websites are listed under the category or categories that apply to their terms of use. Websites may be listed more than once, if multiple restrictions apply. Basic takeaway: nearly all top websites restrict activities that are integral to training generative AI systems.

Importantly, many of the specific restrictions that prevent generative AI training likely have been in place for a while. We suspect that explicit bans on using data for AI training (such as are now in-place at The Economist or the Financial Times) are very new (and were likely not in place when a bunch of generative AI training data was initially scraped). However, most websites have layers of protection, and our best guess is that (1) “Personal Use Only” restrictions, (2) “No Commercial Harvesting” clauses, and (3) prohibitions on modifying or creating derivative works have long been in place. Illustrating this belt-and-suspenders approach, in addition to having an explicit ban on AI training, both The Economist and the Financial Times have (1), (2), and (3).

Summary of Website Terms - Bar graph

Explicitly Prevents AI TrainingPersonal Use OnlyNo Commercial HarvestingProhibits Modifying or Creating Derivative WorksNo RestrictionNo Terms of Use
BBCThe AtlanticThe Atlantic*The AtlanticFrontiersMacRumors
Condé NastBBCBooking.comCBCIPFS
The EconomistBooking.comChicago TribuneChicago TribunePLOS
Financial TimesCBCCondé NastThe EconomistSEC EDGAR*
ScribdChicago TribuneThe EconomistFinancial TimesWikipedia
Condé NastFinancial TimesHearst
CourseraForbesHuffPost
The EconomistGetty ImagesInsider
Financial TimesThe GuardianInstructables
ForbesHearstKickstarter
Getty ImagesHuffPostLaw Insider
The GuardianInsiderLos Angeles Times
HearstKickstarterThe Motley Fool
InsiderLaw InsiderLaw Insider
JalopnikLos Angeles TimesShutterstock
KickstarterMayo ClinicStack Overflow
Los Angeles TimesMediumTechCrunch
The Motley FoolThe Motley FoolTrip Advisor
PBSThe New York Times*The Washington Post
QuartzReddit*WeddingWire
RedditScribdYelp
Stack Overflow*Shutterstock
TechCrunchStack Overflow*
Trip AdvisorSubstack
The Washington PostTrip Advisor
WebMDThe Washington Post
WeddingWireWeddingWire
YelpYelp

*While you can find more information about all websites below, in “The Details”, these particular websites include conditions or caveats as to whether it falls under th|e restriction.


Summary Chart of Website Terms of Use


WebsiteAI Training Allowed or RestrictedDetails
The AtlanticXProhibits obtaining, copying, monitoring, indexing or data mining through the use of a robot, spider, any automated device, or any manual process
BBCXExpressly prohibits use for developing or training AI, and prohibits plucking metadata from content or feeds
Booking.comXExpressly prohibits monitoring, copying, scraping/crawling, downloading, reproduction, or other use of the platform for any commercial purpose
CBCXProhibits uses for anything other than for private purposes, except by agreement and prohibits uses of digital services for business or commercial purposes
Chicago TribuneXExpressly prohibits republishing content or incorporating content in any database, compilation, archive, cache or similar. Also expressly prohibits scraping or copying content, including prohibiting data mining, data gathering or extraction
Condé NastXExpressly prohibits use for training & operating LLMs/AI
CourseraXProhibits use for commercial purposes or for any other purpose than to complete online courses or for pedagogical purposes
The EconomistXExpressly prohibits copying, collecting, scraping; prohibits data mining or application or training of ML or AI tools or models
Financial TimesXExpressly prohibits use of FT content for any ML/AI purposes; also prohibits harvesting and keeping data in a database
ForbesXProhibits data mining, robot, spider, cancelbot, Trojan horse, or any data gathering, scraping, indexing, or extraction method
Frontiers✔️Text and data mining expressly allowed
Getty ImagesXExpressly prohibits downloading or copying content without a license, using any data mining, robots or similar data gathering or extraction methods, and in any way commercializing the site/content
The GuardianXScraping, reproducing, copying, altering, collecting and/or extracting content for text/data aggregation, analysis, mining or commercial purposes is prohibited
HearstXProhibits using any software robots, spider, crawlers, or other data gathering or extraction tools, to access, acquire, copy, monitor, scrape or aggregate content. Prohibits making commercial use of any portion of the site or content
HuffPostXExpressly prohibits any process to crawl or spider, or harvest or scrape any content
InsiderXExpressly prohibits use of the sites to monitor, scrape, index, or otherwise copy any of the material on the Sites by means of any robot, spider, or other automatic device, process, or means, and prohibits accessing material for anything other than personal, non-commercial use
InstructablesXProhibits modification, translation, adapting, arranging, or creating derivative works
IPFS✔️Encourages taking, reusing, re-purposing, remixing content and documentation
JalopnikXRestricts use to personal, noncommercial use only
KickstarterXExpressly prohibits use of any software or device to crawl or spider any part of the site
Law InsiderXExpressly prohibits scraping the services for any purpose
Los Angeles TimesXContent may not be scraped or otherwise copied, and data mining, data gathering or extraction is prohibited
MacRumors✔️No terms of use
Mayo ClinicXExpressly prohibits use of any scraper, crawler, spider, robot or other automated means of any kind to access or copy data on the site
MediumXExpressly prohibits use of any software, script, robot, spider or other automatic device, process or means to access the services for any purpose, including to scrape or copy any of the data or content
The Motley FoolXLimits use to individual, non-commercial use only, and prohibits use of any automated means to access, monitor, copy or harvest data from the site
The New York TimesXAutomated data mining or scraping prohibited, as are caching and archiving (except for search engines creating search indexes)
PBSXCannot use information for commercial purposes
PLOS✔️PLOS articles may be mined, reused, and shared by anyone, anywhere, for any purpose
QuartzXRestricts use to personal, noncommercial use only
RedditXExpressly prohibits accessing, searching or collecting data from the services by any means (although Reddit grants conditional permission to crawl the services in accordance with their robots.txt file) but scraping is prohibited
ScribdXExpressly prohibits use of any robot, spider, scraper or other automated means to access, copy, print, store, transfer or share any content. Expressly prohibits training a large language model
SEC EDGAR✔️Information presented is considered public information and may be copied or further distributed by users of the web site without permission. Does not allow botnets or automated tools to crawl the site
ShutterstockXExpressly prohibits use of data mining, robots or similar data and/or image gathering and extraction methods
Stack Overflow~Exploitation of any content, software, materials or services is prohibited; downloading, copying or storing of any content for other than personal noncommercial use is prohibited. But sometimes CC BY licensed data dumps of content. Possibly okay to use content made available via the Stack Overflow API
SubstackXExpressly prohibits use that crawls, scrapes or spiders any page, data or portion of the site, or copying or storing any significant portion of the content
TechCrunchXProhibits use to reproduce, modify, create derivative works based on, or exploit for any commercial purposes
Trip AdvisorXExpressly prohibits use of any robot, spider, scraper or other automated means or manual process for any purpose, or incorporating any part of the services into other websites or services
The Washington PostXProhibits exploitation of the services except for personal use. Prohibits use of the services to construct a database, unauthorized scraping, spidering, or harvesting of personal information or use of unauthorized automated means to compile information, or use of any engine, software, tool, agent or other device or mechanism to navigate or search the services
WebMDXRestricts authorized use of the site to viewing or downloading a single copy of the content for personal, noncommercial use
WeddingWireXCannot harvest data using an automated software tool or manually. Can only use for non-commercial use. Cannot copy or create derivative works without authorization
Wikipedia✔️Requires that all users contributing to the projects or websites grant broad permissions to the general public to redistribute and reuse their contributions freely
YelpXExpressly prohibits using the service for commercial purposes, and prohibits use of any robot, spider, service search/retrieval application or other automated device, process or means to access, retrieve, copy, scrape, or index any portion of the service/content

The Details

The Atlantic

Version: April 4, 2023

Relevant Restrictions:

Access
  • The Sites are provided solely for your personal, non-commercial use. You may not develop or derive for commercial sale any data in machine-readable or other form that incorporates or uses any substantial part of the Sites nor transfer to or store any data residing or exchanged over the Sites in any electronic network for use by more than one user unless you obtain prior written permission from The Atlantic. Specifically, unless explicitly authorized in these Terms & Conditions or by the owner of the materials, you may not modify, copy, reproduce, republish, upload, post, transmit, translate, sell, create derivative works, exploit, or distribute in any manner or medium any material from the Sites.
Intellectual Property
  • Any use for a commercial or public purpose requires specific written permission from The Atlantic.

Acceptable Use Policy You are solely responsible for any and all acts and omissions that occur during or relating to your use of the Sites, and you agree not to engage in unacceptable use of the Sites, which includes, without limitation, use of the Sites to:

  • Obtain, copy, monitor, index or data mine through the use of a robot, spider, any automated device, or any manual process, the Sites or the contents (except as expressly permitted by The Atlantic);

BBC

Version: v1.8, September 19, 2022

Relevant Restrictions:

  1. Terms for using our services and content

    d. Don’t pretend to be the BBC

    Except at fancy dress parties. That includes… Making money from our content or services

  2. Using BBC content

    a. When you need permission

    To use any of the following things… Anything plucked from our services to develop or train artificial intelligence or to do computer analysis

  3. Metadata and RSS feeds

    a. For people

    You’re not allowed to pluck metadata from our content or RSS feeds.

    b. For business

    You’ll need a licence to use our metadata (such as images, text, media and the links to them). Apply for a metadata licence.


Booking.com

Version: February 14, 2022

Relevant Restrictions:

A14. Intellectual property rights
  1. You’re not allowed to monitor, copy, scrape/crawl, download, reproduce, or otherwise use anything on our Platform for any commercial purpose without written permission of Booking.com or its licensors.

CBC

Version: January 2022

Relevant Restrictions:

  1. CBC/Radio-Canada Content

    b) Do these terms of use apply to news feeds (RSS) and podcasts?

    Yes. These terms also apply to the use of CBC/Radio-Canada news feeds. Any use other than for private purposes must be subject to an agreement with CBC/Radio-Canada specifying the conditions for use with due regard for the integrity of the content. You agree not to frame the news feed or its content, nor to use similar means to generate unauthorized benefits.

    d) What rights do I have to software and applications made available for my use by CBC/Radio-Canada digital services?

    Only for navigating. Some web activities require the use of software or applications—e.g., digital markers including cookies—specifically developed for such purpose. When such software and applications are made available to you, you have the right to download and use them for the sole purpose as intended by CBC/Radio-Canada, but you have no other rights to reproduce, modify or adapt them for any purposes whatsoever.

  2. Conducting business on CBC/Radio-Canada digital services

    a) May I use CBC/Radio-Canada digital services for business or commercial purposes?

    Only if specifically authorized by CBC/Radio-Canada. You may not use any CBC/Radio-Canada digital services for business or commercial purposes without prior written permission. You may, however, provide a link to a CBC/Radio-Canada digital page; this will ensure full, unmodified communication of content and respect the rights of any third parties involved.


Chicago Tribune

Version: February 1, 2019

Relevant Restrictions:

Copyright… You may use the Content online only, and solely for your personal, non-commercial use, and you may download or print a single copy of any portion of the Content solely for your personal, non-commercial use, provided you do not remove any trademark, copyright or other notice from such Content. If you operate a Web site and wish to link to the Site, you may do so provided you agree to cease such link upon request from us. No other use is permitted without prior written permission of Tribune Publishing…

Except where explicitly provided for herein or on the Site, you may not republish any portion of the Content on any Internet, Intranet, extranet site or any other online or offline publication, or incorporate the Content in any database, compilation, archive, cache, or similar medium. You may not distribute any Content to others, whether or not for payment or other consideration, and you may not archive, modify, copy, frame, cache, reproduce, sell, publish, transmit, display or otherwise use any portion of the Content. You may not scrape or otherwise copy our Content without our permission. You agree not to decompile, reverse engineer or disassemble any software or other products or processes accessible through the Site nor to insert any code or product or manipulate the Content or the Site in any way, and not to use any data mining, data gathering or extraction method.


Condé Nast

Version: April 27, 2023

Relevant Restrictions:

V. A. 4. Unless otherwise specified, the Service is intended for your personal, non-commercial use only. You may not access and/or store the Service or any of its Content except for personal, noncommercial use.

V. B. 1. Prohibitions on Use of the Service

Absent explicit prior written consent in certain situations, you may not, nor may you allow, enable, authorize, instruct, encourage, assist, suggest, inform, or promote that others, directly or indirectly, do any of the following for any reason:

  • copy, harvest, crawl, index, scrape, spider, mine, gather, extract, compile, obtain, aggregate, capture, access, store, or republish any Content on or through the Service, including by an automated or manual process or otherwise, for any and all purposes other than indexing Content for inclusion in a Search Engine, including but not limited to any purpose related to data mining and/or the training or operation of any software or service to the extent that it incorporates a large language model, foundation model, deep machine learning, generative artificial intelligence, or any other process of a nature commonly referred to as artificial intelligence;

Coursera

Version: January 1, 2023

Relevant Restrictions:

  1. Using Coursera

    Commercial Use

    Any use of our Services for commercial purposes is strictly prohibited…

    Acceptable Use Policy

  2. You also aren’t allowed to:

    Use our Services or any functionality of the Coursera platform for anything other than for completing online courses or for pedagogical purposes.


The Economist

Version: March 31, 2023

Relevant Restrictions:

Use of Economist Content

All Economist Content is strictly for personal, non-commercial use only…

Except as expressly permitted above, you may not reproduce, modify or in any way commercially exploit any Economist Content. In particular, but without limiting the general application of the restrictions in the previous sentence, you may not do any of the following without prior written permission from The Economist Group…

  • modify, publish, transmit, participate in the transfer or sale of, reproduce, create derivative works from, distribute, perform, display, or in any way exploit all or any part of the Economist Content (including as part of any library, archive or similar service);
  • use or permit the use, where by automated means or otherwise, of any software, tool or other device (including, but not limited to robots, crawlers, spiders or scripts) on any Sites or Digital Applications, or otherwise on Economist Content, in order to copy, collect or scrape on the Sites, Digital Applications or Economist Content (other than any such use by a public search engine for the sole purpose of providing direct, non-amalgamated links to Economist Content that do not include any generative or derivative works based on or including Economist Content);
  • conduct data mining on, apply machine learning tools or models to, or trainmachine learning tools or models or any other artificial intelligence technology on, any Sites, Digital Applications or Economist Content; or
  • use artificial intelligence tools or models for the purposes of generating text, images or any other material, output or derivative works based on or using Economist Content, whether or not in the same or similar style as the Economist Content.

Financial Times

Version: Version date unclear. Accessed June 25, 2023

Relevant Restrictions:

3.5. To the fullest extent permitted by law, we expressly prohibit any use of our content or data (including any associated metadata) in any manner for any machine learning and/or artificial intelligence purposes, including without limitation for the purposes of training or development of artificial intelligence technologies or tools or machine learning language models, or otherwise for the purposes of using or in connection with the use of such technologies, tools or models to generate any data or content and/or to synthesise or combine with any other data or content. We reserve all rights to license any use of our content and dadta for any such purposes.

Copyright Policy

How may I use FT content?

You may do the following:

  • View our content for your personal use on any device that is compatible with FT.com (this might be your PC, laptop, smartphone, tablet or other mobile device) and store our content on that device for your personal use;

What am I not permitted to do with FT content? You cannot do anything other than make use of the content as set out above, unless you buy the appropriate licence (see below for details). By way of example only, this means that you cannot:

  • Copy, publish or redistribute full text articles, photographs, graphics, tables or images in any way (except as permitted by any sharing tools we make available).
  • Create derivative works from our content, unless you are creating summaries as described above.
  • Photocopy or scan copies of articles.
  • Remove the copyright or or trade mark notice from any copies of FT content.
  • Use spidering technology or other datamining technologies to search and link to FT.com.
  • Create a database in electronic or structured manual from by systematically and/or regularly downloading, caching, printing and storing all or any FT content (by spidering or otherwise).
  • Use any of our content or data (including any associated metadata) in any manner for any machine learning and/or artificial intelligence purposes, including without limitation for the purposes of training or development of artificial intelligence technologies or tools or machine learning language models, or otherwise for the purposes of using or in connection with the use of such technologies, tools or models to generate any data or content and/or to synthesise or combine with any other data or content.
  • Frame, harvest or scrape FT content or otherwise access FT content for similar purposes.
  • Use or attempt to use FT content outside the parameters we set depending on what subscription you have.

Forbes

Version: April 5, 2023

Relevant Restrictions:

1.3 You may use the Website, Forbes Channels, and Content online and solely for personal, non-commercial, and informational/entertainment use, and you may download or print a single copy of any downloadable portion of the Content, where permitted, for your personal, non-commercial, and informational use, provided you do not remove any trademark, copyright or other notice contained in such Content. No other use is permitted without securing the prior written consent of Forbes.

2.1: By accessing or using the Website and/or Forbes Channels, including any Content, you agree to use them only as expressly permitted by these Terms. Unless you have Forbes’ prior written permission, you shall not:

  • Use any data mining, robot, spider, cancelbot, Trojan horse, or any data gathering, scraping, indexing, or extraction method on any part of the Website or Forbes Channels;

Frontiers

Version: Version date unclear. Accessed June 25, 2023

Relevant Restrictions:

  1. You may benefit from the CC-BY licence (or other licences as may be indicated for specific Journals or articles) over articles and other content, and from a CC0 licence for certain article metadata, all as described elsewhere in these Conditions.

  2. Text and Data Mining; Bulk Downloading Text and data mining of public content are permitted for the legitimate purpose of enrichment of external collections of scientific literature. However, we may need to restrict this activity if it hinders the performance of one or more Websites.

Text and data mined, and content downloaded in bulk or individually from any Website, remain subject to the relevant licence (CC-BY, CC0 or other relevant licence), and all attribution and other conditions of such licences must be complied with. To the extent that any such text or data constitutes or includes Personal Data, all applicable data protection laws must also be complied with. Further information on text and data mining and bulk content downloads for research purposes and for repositories and databases can be found here [non-working link].


Getty Images

Version: August 2022

Relevant Restrictions:

Use of the Site: The Site and the Getty Images Content are intended for customers of Getty Images. You may not use the Site or the Getty Images Content for any purpose not related to your business with Getty Images. You are specifically prohibited from: (a) downloading, copying or re-transmitting any or all of the Site or the Getty Images Content without, or in violation of, a written licence or agreement with Getty Images; (b) using any data mining, robots or similar data gathering or extraction methods; … (g) selling, licensing, leasing or in any way commercialising the Site or the Getty Images Content without specific written authorisation from Getty Images; and (h) using the Site or the Getty Images Content other than for its intended purpose…


The Guardian

Version: March 10, 2023

Relevant Restrictions:

  1. Use of material appearing on the Guardian Site

Your use of the Guardian Site and Guardian Content is for your own personal and non-commercial use only. You may download and print extracts from the Guardian Content for your own personal and non-commercial use only, provided you maintain and abide by any author attribution, copyright or trademark notice or restriction in any material that you download or print.

You shall not use (and you shall also not facilitate, authorise or permit the use of) the Guardian Site and/or any Guardian Content for any other purpose without our prior written approval - this includes, without limitation, any scraping of the Guardian Content or reproduction, copying, alteration, collection and/or extraction of the Guardian Content, in each case, for text and data aggregation, analysis or mining purposes or for any commercial use.

Other than as expressly set out in these terms and conditions, or otherwise approved by the Guardian, you shall also not use, or facilitate, authorise or permit the use of, any “robot”, “bot”, “spider”, “scraper”, or other automated device, program, technique, tool, process or method, on or in relation to the Guardian Site and/or the Guardian Content for any purpose (such as, but not limited to, gathering or extraction).


Hearst

Version: September 27, 2019

Relevant Restrictions:

  1. Intellectual and Other Proprietary Rights
  • You acknowledge Hearst’s valid intellectual and proprietary property rights in the Site and content and that your use of the Site is limited to accessing, viewing and downloading of the Site and content, as authorized by Hearst…
  • You may not either directly or through the use of a Device or other means copy, download, stream, reproduce, duplicate, archive, distribute, upload, publish, modify, translate, broadcast, perform, display, sell, transmit or retransmit the Site or content unless expressly permitted by Hearst in writing. You may not incorporate the Site or content into, or stream or retransmit the Site or content via, any hardware or software application or make the Site or any content available via frames or in-line links, and you may not otherwise surround or obfuscate the Site or content with any third-party content, materials or branding. You may also not use any software robots, spider, crawlers, or other data gathering or extraction tools, whether automated or manual, to access, acquire, copy, monitor, scrape or aggregate the Site, content or any portion thereof…
  • You may not build a business, in whole or in part, resell, redistribute, recirculate or make any other commercial use of, or create derivative works or materials utilizing any portion of the Site (including any code used in any software) or content, whether or not for profit.

HuffPost

Version: February 16, 2021 Relevant Restrictions: Rules of Conduct

You shall not:

(v) use manual or automated software, devices, or other processes to “crawl” or “spider” any page of the Site; (vi) harvest or scrape any Content from the Services;

You shall not (directly or indirectly):

(ii) modify, translate, or otherwise create derivative works of any part of the Services;


Insider

Version: February 22, 2023

Relevant Restrictions:

  1. INTELLECTUAL PROPERTY RIGHTS

You may access the material on the Sites only for your own personal, non-commercial use. You must not reproduce, distribute, modify, create derivative works of, publicly display, publicly perform, republish, download, store or transmit any of the material on our Sites, except as incidental to normal web browsing, such as the making of temporary copies in RAM or the cache of your Internet browser, and for features of the Sites that enable sharing via e-mail, social media, linking, and other platforms expressly enabled by the Sites.

  1. PROHIBITED USES OF THE SITES

You may use the Sites only for lawful purposes and in accordance with these Terms of Service. You agree not to use the Sites:

To monitor, scrape, index, or otherwise copy any of the material on the Sites by means of any robot, spider, or other automatic device, process, or means, regardless of whether such use may be considered a fair use under United States copyright law.


Instructables

Version: June 5, 2013

Relevant Restrictions:

  1. Your Right to Access or Use the Service. In accessing or using the Service, you agree … not to (or permit anyone else to) do or attempt any of the following…
  • modify, translate, adapt, arrange, or create derivative works of the Service, except as permitted in these Terms;
  • use the Service, any feature thereof or any Content in a way that could or does violate any law or the rights (including without limitation, the copyright, trademark, patent, trade secret other intellectual property, proprietary or other rights) of any person, firm or entity or expose us, any users or any of Our Parties to legal liability;

IPFS

Version: Unclear, accessed June 27, 2023

Relevant Restrictions:

We are pleased to license much of the content and documentation available on our sites under terms that explicitly encourage people to take, modify, reuse, re-purpose, and remix our work as they see fit.

You will find the following notice at the bottom of many pages on the Google Developers website: Except as noted, content licensed CC-BY 3.0, code licensed MIT.

When you see a page with this notice you are free to use nearly everything on the page in your own creations. That’s what open content licenses are all about. We just ask that you give us attribution when you reuse our work.

What is not licensed?

Trademarks and other brand features are not included in this license.

In some cases, a page may include content consisting of images, audio or video material, or a link to content on a different webpage (such as videos or slide decks). This content is not covered by the license, unless specifically noted.

Attribution Proper attribution is required when you reuse or create modified versions of content that appears on a page made available under the terms of the Creative Commons Attribution license. The complete requirements for attribution can be found in section 4 of the Creative Commons legal code.

In practice, we ask that you provide attribution to Protocol Labs to the best of the ability of the medium in which you are producing the work.


Jalopnik

Version: March 29, 2023

Relevant Restrictions:

Note: Same terms cover The A.V. Club, Deadspin, Gizmodo, The Inventory, Jalopnik, Jezebel, Kotaku, Quartz, The Onion, The Root and The Takeout.

  1. Licensing Agreements

Site License:

Subject to your acceptance of these Terms, G/O grants you a non-exclusive, limited, non-transferable, freely revocable license to use the Sites for your personal, noncommercial (i.e. you may not use the Sites to provide or serve or permit others to provide or serve ads or contests or sweepstakes) use only and as permitted by the features of the Sites.


Kickstarter

Version: November 30, 2022

Relevant Restrictions:

  1. Things you Definitely Shouldn’t Do
  • Don’t use any kind of software or device (whether it’s manual or automated) to “crawl” or “spider” any part of the Site
  1. Kickstarter’s Intellectual Property

Kickstarter grants you a license to reproduce content from the Services for personal use only. This license covers both Kickstarter’s own protected content and user-generated content on the Site. (This license is worldwide, non-exclusive, non-sublicensable, and non-transferable.) If you want to use, reproduce, modify, distribute, or store any of this content for a commercial purpose, you need prior written permission from Kickstarter or the relevant copyright holder. A “commercial purpose” means you intend to use, sell, license, rent, or otherwise exploit content for commercial use, in any way


Law Insider

Version: September 1, 2021

Relevant Restrictions:

Prohibited Uses.
  • You agree not to, and will not assist, encourage, or enable others to use the Services: To access or copy in bulk, retrieve, harvest, or index any portion of the Services (“Scrape”) or use, support, or develop any robot, spider, scripts, or other automatic device, process, or means (such as crawlers, browser plug-ins and add-ons, or other technology) to Scrape the Services for any purpose. IF YOU SCRAPE THE SERVICES OR ANY PORTION THEREOF, WE MAY SEEK LEGAL ACTION AGAINST YOU, INCLUDING SENDING NOTICE LETTERS TO YOU AND YOUR CUSTOMERS OR END USERS THAT YOU ARE UNLAWFULLY DISTRIBUTING DATA OBTAINED FROM THE SERVICES IN VIOLATION OF THESE TERMS.

  • To modify, adapt, appropriate, reproduce, distribute, translate, create derivative works or adaptations of, publicly display, sell, trade, or in any way exploit Company IP, except as expressly authorized by Company.


Los Angeles Times

Version: January 31, 2022

Relevant Restrictions:

Copyright… You may use the Content online only, and solely for your personal, non-commercial use, and you may download or print a single copy of any portion of the Content solely for your personal, non-commercial use, provided you do not remove any trademark, copyright or other notice from such Content…

Except where explicitly provided for herein or on the Site, you may not republish any portion of the Content on any Internet, Intranet, extranet site or any other online or offline publication, or incorporate the Content in any database, compilation, archive, cache, or similar medium. You may not distribute any Content to others, whether or not for payment or other consideration, and you may not archive, modify, copy, frame, cache, reproduce, sell, publish, transmit, display or otherwise use any portion of the Content. You may not scrape or otherwise copy our Content without our permission. You agree not to decompile, reverse engineer or disassemble any software or other products or processes accessible through the Site nor to insert any code or product or manipulate the Content or the Site in any way, and not to use any data mining, data gathering or extraction method.


MacRumors

Version: Visited July 18, 2023

Relevant Restrictions: No terms of use evident for the site.


Mayo Clinic

Version: January 27, 2022

Relevant Restrictions:

Acceptable use

You agree that you will not:

  • Use any scraper, crawler, spider, robot, or other automated means of any kind to access or copy data on the Site, deep-link to any feature or content on the Site, or bypass our robot exclusion headers or other measures we may use to prevent or restrict access to the Site.

Medium

Version: September 1, 2020; See also Rules (May 26, 2021)

Relevant Restrictions:

Rules

Prohibitions on Use of the Services

You agree not to do, try to do, or cause a third party to do any of the following, except without the express written consent of Medium:

(5) use any software, script, robot, spider or other automatic device, process or means (including crawlers, browser plugins and add-ons or any other technology) to access the Services for any purpose, including without limitation to scrape or otherwise copy any of the data or content on the Services;

(6) use any manual process to monitor or copy any of the data or content on the Services, or to engage in any other unauthorized purpose;


The Motley Fool

Version: Version unclear, accessed June 27, 2023

Relevant Restrictions:

Restriction on Re-use

Our Content is intended for individual, non-commercial use only. Accordingly, The Motley Fool grants you a limited, non-exclusive and non-transferable licence to download, use and display the Content for your personal and non-commercial use only, provided that the material remains intact including without limitation all copyright and proprietary notices. Except as expressly provided herein, you agree not to modify, reproduce, make derivative works of, retransmit, sell, publish, communicate, broadcast or otherwise make available any of the Content by any means including without limitation by caching, framing, or deep-linking without the prior written consent of The Motley Fool.

Accessing and Using Fool.ca

You agree not to access this website by any other means other than through the interfaces we provide for use. You further agree not to use any automated means, including without limitation, agents, robots, scripts, or spiders, to access, monitor, copy or harvest data from any part of our website without our prior consent.


PBS

Version: Visited July 18, 2023

Relevant Restrictions:

  1. Personal Uses Permitted. You shall not post, publish, transmit, reproduce, distribute or in any way use or exploit any Information for commercial purposes or otherwise use the Information in a manner that is inconsistent with these rules and regulations.

  2. Podcasts. PBS-hosted podcast(s) are available for personal, noncommercial use only…


The New York Times

Version: December 19, 2022

Relevant Restrictions:

4. PROHIBITED USE OF THE SERVICES

Without NYT’s prior written consent, you shall not:

(ii) use robots, spiders, scripts, service, software or any manual or automatic device, tool, or process designed to data mine or scrape the Content, data or information from the Services, or otherwise access or collect the Content, data or information from the Services using automated means;

(iv) cache or archive the Content (except for a public search engine’s use of spiders for creating search indices);


PLOS

Version: February 8, 2023. See also Text & Data Mining

Relevant Restrictions:

PLOS articles may be mined, reused, and shared by anyone, anywhere, for any purpose


Quartz

Version: March 29, 2023

Relevant Restrictions:

Note: Same terms cover The A.V. Club, Deadspin, Gizmodo, The Inventory, Jalopnik, Jezebel, Kotaku, Quartz, The Onion, The Root and The Takeout.

  1. Licensing Agreements

Site License:

Subject to your acceptance of these Terms, G/O grants you a non-exclusive, limited, non-transferable, freely revocable license to use the Sites for your personal, noncommercial (i.e. you may not use the Sites to provide or serve or permit others to provide or serve ads or contests or sweepstakes) use only and as permitted by the features of the Sites.


Reddit

Version: June 19, 2023 (if you live outside the EEA, United Kingdom, or Switzerland)

Relevant Restrictions:

3. Your Use of the Services

Except and solely to the extent such a restriction is impermissible under applicable law, you may not, without our written agreement:

  • license, sell, transfer, assign, distribute, host, or otherwise commercially exploit the Services or Content;
  • modify, prepare derivative works of, disassemble, decompile, or reverse engineer any part of the Services or Content; or
  • access the Services or Content in order to build a similar or competitive website, product, or service, except as permitted under any Additional Terms (as defined below).
7. Things You Cannot Do

In addition to what is prohibited in the Content Policy, you may not do any of the following:

  • Access, search, or collect data from the Services by any means (automated or otherwise) except as permitted in these Terms or in a separate agreement with Reddit (we conditionally grant permission to crawl the Services in accordance with the parameters set forth in our robots.txt file, but scraping the Services without Reddit’s prior written consent is prohibited);

Scribd

Version: November 13, 2020

Relevant Restrictions:

9. Prohibited Conduct
BY USING SCRIBD YOU AGREE NOT TO:

9.1 use Scribd for any purpose other than to receive original or appropriately licensed content, to add Descriptive Information, and/or to access Scribd as such Services are offered by Scribd (“Descriptive Information” refers to the corresponding title and description of User Content posted by Users along with such content);

9.2 rent, lease, loan, sell, resell, sublicense, distribute, display or otherwise transfer the licenses granted herein or any Materials (as defined in section 13, below);

9.15 use any robot, spider, scraper, or other automated means to access Scribd, or copy, print, access, store, transfer, or share any content accessible through Scribd, for any purpose or to bypass any measures Scribd may use to prevent or restrict access, or the ability to copy, print, access, store, transfer, or share content;

9.20 use any portion of the content on Scribd for the purposes of training a large language model.

SEC EDGAR

Version: October 17, 2022, see also Privacy Information (June 7, 2023 version)

Relevant Restrictions:

Fair Access

To ensure everyone has equitable access to SEC EDGAR content, please use efficient scripting. Download only what you need and please moderate requests to minimize server load.

SEC reserves the right to limit request rates to preserve fair access for all users. See our Internet Security Policy for our current rate request limit.

The SEC does not allow botnets or automated tools to crawl the site. Any request that has been identified as part of a botnet or an automated tool outside of the acceptable policy will be managed to ensure fair access for all users. Privacy Information Website Dissemination Information presented on www.sec.gov is considered public information and may be copied or further distributed by users of the web site without the SEC’s permission. Please consider appropriate citation to the SEC as the source.


Shutterstock

Version: Unclear, accessed June 27, 2023

Relevant Restrictions:

2.2 All content on this Site, including but not limited to Images, Footage, Music, and related metadata (collectively the “Shutterstock Content”), as well as the selection and arrangement of the Shutterstock Content, are protected by copyright, trademark, patent, trade secret and other intellectual property laws and treaties. Any unauthorized use of any Shutterstock Content violates such laws and this Terms of Use. Except as expressly provided herein or in a separate license agreement between you and Shutterstock, Shutterstock does not grant any express or implied permission to use the Site or any Shutterstock Content. You agree not to copy, republish, frame, link to, download, transmit, modify, adapt, create derivative works based on, rent, lease, loan, sell, assign, distribute, display, perform, license, sublicense or reverse engineer the Site or any Shutterstock Content. In addition, you agree not to use any data mining, robots or similar data and/or image gathering and extraction methods in connection with the Site or Shutterstock Content.


Stack Overflow

Version: January 12, 2022

Relevant Restrictions:

6. Content Permissions, Restrictions, and Creative Commons Licensing

Stack Overflow Content

Other than as expressly set forth in these Public Network Terms, you may not copy, modify, publish, transmit, upload, participate in the transfer or sale of, reproduce (except as provided in this Agreement), create derivative works based on, distribute, perform, display, or in any way exploit any of the Network Content, software, materials, or Services in whole or in part. You may download or copy the public Network Content, and other items displayed on the public Network for download or personal use provided that you maintain all copyright and other notices contained in such Public Content.

From time to time, Stack Overflow may make available compilations of all the Subscriber Content on the public Network (the “Creative Commons Data Dump”). The Creative Commons Data Dump is licensed under the CC BY-SA license. By downloading the Creative Commons Data Dump, you agree to be bound by the terms of that license.

Any other downloading, copying, or storing of any public Network Content (other than Subscriber Content or content made available via the Stack Overflow API) for other than personal, noncommercial use is expressly prohibited without prior written permission from Stack Overflow or from the copyright holder identified in the copyright notice per the Creative Commons License. In the event you download software from the public Network (other than Subscriber Content or content made available by the Stack Overflow API) the software including any files, images incorporated in or generated by the software, the data accompanying the software (collectively, the “Software”) is licensed to you by Stack Overflow or third party licensors for your personal, noncommercial use, and no title to the Software shall transfer to you. Stack Overflow or third party licensors retain full and complete title to the Software and all intellectual property rights therein.


Substack

Version: August 17, 2021

Relevant Restrictions:

Acceptable Use Policy

You also agree that you will not contribute any Post or otherwise use Substack in a manner that:

  • “Crawls,” “scrapes,” or “spiders” any page, data, or portion of Substack (through use of manual or automated means);
  • Copies or stores any significant portion of the content on Substack;
  • Decompiles, reverse engineers, or otherwise attempts to obtain the source code or underlying ideas or information of or relating to Substack.

TechCrunch

Version: April 27, 2023

Relevant Restrictions:

2. Using the Services
h. Ownership and Reuse. Using the Services does not give you ownership of any intellectual or other property rights or interests in the Services or the content you access. You must not use any branding or logos used in the Services unless we have given you separate explicit written permission. You may not remove, obscure, or alter any legal notices displayed in or along with the Services. Unless you have explicit written permission, you must not reproduce, modify, rent, lease, sell, trade, distribute, transmit, broadcast, publicly perform, create derivative works based on, or exploit for any commercial purposes, any portion or use of, or access to, the Services (including content, advertisements, APIs, and software).

Trip Advisor

Version: December 15, 2022

Relevant Restrictions:

Prohibited Activities

For all Content other than your Content, you agree not to otherwise modify, copy, distribute, transmit, display, perform, reproduce, publish, license, create derivative works from, transfer or sell or re-sell any information, software, products or services obtained from or through the Services. Additionally, you agree not to:

  • (i) use the Services or Content for any commercial purpose, outside the scope of those commercial purposes explicitly permitted under this Agreement and related guidelines as made available by the Tripadvisor Companies;
  • (ii) access, monitor, reproduce, distribute, transmit, broadcast, display, sell, license, copy or otherwise exploit any Content of the Services, including but not limited to, user profiles and photos, using any robot, spider, scraper or other automated means or any manual process for any purpose not in accordance with this Agreement or without our express written permission;
  • (vi) “frame”, “mirror” or otherwise incorporate any part of the Services into any other websites or service without our prior written authorisation;

The Washington Post

Version: July 1, 2014

Relevant Restrictions:

  1. Copyright… Except for content that you have posted on the Services, or unless expressly authorized by The Washington Post in writing, you are prohibited from publishing, reproducing, distributing, publishing, entering into a database, displaying, performing, modifying, creating derivative works, transmitting, or in any way exploiting any part of the Services, except that you may make use of the content for your own personal use as follows: you may make one machine readable copy and/or print copy that is limited to occasional articles of personal interest only.

  2. Prohibited Conduct… You may not access or use, or attempt to access or use, the Services to take any action that could harm us or any third party, interfere with the operation of the Services, or use the Services in a manner that violates any laws. For example, and without limitation, you may not:

  • Make use of the contents of the Services in any manner that constitutes an infringement of our rights or the rights of other users or third parties, including copyrights.
  • Copy, reproduce, distribute, publish, enter into a database, display, perform, modify, create derivative works, transmit, or in any way exploit any part of the Services, except for content you have posted on the Services, or unless expressly authorized. You may download material from the Services solely for your own personal use as follows: you may make one machine readable copy and/or one print copy that is limited to occasional articles of personal interest only.
  • Distribute any part of the Services over any network, including a local area network, nor sell or offer it for sale. See our Reprints & Permissions section for more information on distribution. In addition, these files may not be used to construct any kind of database.
  • Engage in unauthorized “scraping” or spidering, or harvesting of personal information, or use any unauthorized automated means to compile information.
  • Use or attempt to use any engine, software, tool, agent, or other device or mechanism (including, without limitation, browsers, spiders, robots, avatars, or intelligent agents) to navigate or search the Services other than the search engine and search agents available on the Services and other than generally available third-party web browsers.

WebMD

Version: July 31, 2020

Relevant Restrictions:

Use of the Content

The Content posted on this Site is protected by the copyright laws in the United States and in foreign countries. WebMD authorizes you to view or download a single copy of the Content solely for your personal, noncommercial use if you include the copyright notice located at the end of the material, for example: “©2016, WebMD, LLC. All rights reserved” and other copyright and proprietary rights notices that are contained in the Content…

Title to the Content remains with WebMD or its licensors. Any use of the Content not expressly permitted by these Terms and Conditions is a breach of these Terms and Conditions and may violate copyright, trademark, and other laws.


WeddingWire

Version: February 14, 2022 😆

Relevant Restrictions:

8. Rules for Using the Services

You must comply with all applicable laws and contractual obligations when you use the Services. In using the Services, you also agree to abide by the rules outlined below.

Users of the Services

As a User of the Services, you expressly agree not to:

  • “Harvest,” “scrape,” “stream catch” or collect information from the Services using an automated software tool (including but not limited to use of robots, spiders, or similar means), or manually on a mass basis (unless we have given you separate written permission to do so); This includes, for example, information about other Users of the Services and information about the offerings, products, services and promotions available on or through the Services;
  • Take any action that imposes an unreasonable or disproportionately large load on the infrastructure of the Services or our systems or networks, or any systems or networks connected to the Services, including by “flooding” the Services with requests;
  • Use the Services to gain competitive intelligence about us, the Services, or any product offered via the Services or to otherwise complete with us or our affiliates, or use information on the Services to create or sell a similar product or information;
9. Protection of Intellectual Property Content

Our Services contain copyrighted material, inventions, know-how, potentially patentable business method material, design logos, phrases, names, logos, HTML code and/or other computer code and/or scripts (collectively, “Intellectual Property Content”). Unless otherwise indicated and/or provided pursuant to a third-party license, our Intellectual Property Content is our sole property, and we retain all appurtenant rights, interests and title thereto. We also claim ownership rights under the copyright and trademark laws with regard to the “look,” “feel,” “appearance” and “graphic function” of this Services, including but not limited to its color combinations, sounds, layouts and designs.

You may use the Services (including any content and materials included on the Services) for your own personal, non-commercial use, but you may not use it for commercial purposes. You may not modify, copy, reproduce, republish, upload, post, transmit, translate, sell, create derivative works, exploit, or distribute in any manner or medium (including by email or other electronic means) any material from the Services unless explicitly authorized in these Terms. You may not frame or link to the Services without our prior written permission.


Wikipedia

Version: June 7, 2023

Relevant Restrictions:

  1. Licensing of Content To grow the commons of free knowledge and free culture, all users contributing to the Projects or Project Websites are required to grant broad permissions to the general public to redistribute and reuse their contributions freely, so long as that use is properly attributed and the same freedom to reuse and redistribute is granted to any derivative works. In keeping with our goal of providing free information to the widest possible audience, we require that when necessary all submitted content be licensed so that it is freely reusable by anyone who may access it.

    g. Re-use: Reuse of content that we host is welcome, though exceptions exist for content contributed under “fair use” or similar exemptions under applicable copyright law. Any reuse must comply with the underlying license(s).

    When you reuse or redistribute a text page developed by the Wikimedia community, you agree to attribute the authors in any of the following fashions:

    i. Through hyperlink (where possible) or URL to the page or pages that you are reusing (since each page has a history page that lists all contributors, authors and editors);

    ii. Through hyperlink (where possible) or URL to an alternative, stable online copy that is freely accessible, which conforms with the license, and which provides credit to the authors in a manner equivalent to the credit given on the Project Website; or

    iii. Through a list of all authors (but please note that any list of authors may be filtered to exclude very small or irrelevant contributions).

If the text content was imported from another source, it is possible that the content is licensed under a compatible CC BY-SA license but not GFDL (as described in “Importing text,” above). In that case, you agree to comply with the compatible CC BY-SA license and do not have the option to relicense it under GFDL. To determine the license that applies to the content that you seek to reuse or redistribute, you should review the page footer, page history, and discussion page. In addition, please be aware that text that originated from external sources and was imported into a Project may be under a license that attaches additional attribution requirements. Users agree to indicate these additional attribution requirements clearly. Depending on the Project, such requirements may appear, for example, in a banner or other notations pointing out that some or all of the content was originally published elsewhere. Where there are such visible notations, reusers should preserve them.

For any non-text media, you agree to comply with the applicable license under which the work has been made available (which can be discovered by clicking on the work and looking at the licensing section on its description page or reviewing an applicable source page for that work). When reusing any content that we host, you agree to comply with the relevant attribution requirements as they pertain to the underlying license or licenses.

h. Modifications or additions to material that you reuse: When modifying or making additions to text that you have obtained from a Project Website, you agree to license the modified or added content under CC BY-SA 4.0 or later (or, as explained above, another license when exceptionally required by the specific Project edition or feature).

When modifying or making additions to any non-text media that you have obtained from a Project website, you agree to license the modified or added content in accordance with whatever license under which the work has been made available.

With both text content and non-text media, you agree to clearly indicate that the original work has been modified. If you are reusing text content in a wiki, it is sufficient to indicate in the page history that you made a change to the imported text. For each copy or modified version that you distribute, you agree to include a licensing notice stating which license the work is released under, along with either a hyperlink or URL to the text of the license or a copy of the license itself.


Yelp

Version: December 13, 2019

Relevant Restrictions:

  1. CONTENT

    C. Ownership. As between you and Yelp, you own Your Content. We own the Yelp Content, including but not limited to visual interfaces, interactive features, graphics, design, compilation (including, but not limited to, our selection, coordination, aggregation, and arrangement of User Content and other Service Content), computer code, products, software, aggregate star ratings, and all other elements and components of the Service excluding Your Content, User Content and Third Party Content. We also own the copyrights, trademarks, service marks, trade names, trade secrets, and other intellectual and proprietary rights throughout the world associated with the Yelp Content and the Service, which are protected by copyright, trade dress, patent, trademark, and trade secret laws and all other applicable intellectual and proprietary rights and laws. As such, you may not sell, license, copy, publish, modify, reproduce, distribute, create derivative works or adaptations of, publicly display or in any way use or exploit any of the Yelp Content in whole or in part except as expressly authorized by us. Except as expressly and unambiguously provided herein, we do not grant you any express or implied rights, and all rights in and to the Service and the Yelp Content are retained by us.

  2. REPRESENTATIONS AND WARRANTIES

    B. You also represent and warrant that you will not, and will not assist, encourage, or enable others to use the Service to:

    v. Promote a business or other commercial venture or event, or otherwise use the Service for commercial purposes, except in connection with a Business Account in accordance with the Business Terms;

    ix. Modify, adapt, appropriate, reproduce, distribute, translate, create derivative works or adaptations of, publicly display, sell, trade, or in any way exploit the Service or Service Content (other than Your Content), except as expressly authorized by Yelp;

    x. Use any robot, spider, Service search/retrieval application, or other automated device, process or means to access, retrieve, copy, scrape, or index any portion of the Service or any Service Content, except as expressly permitted by Yelp (for example, as described at www.yelp.com/robots.txt);


Background Information

Are Browsewrap Contracts Valid?

Maybe. Browsewrap contracts (which applies to visitors to a website) don’t hold up as well as, say, clickwrap agreements (where you have to affirmatively agree to terms - i.e., clicking “I agree” at the end of a set of terms). But browsewrap contracts can be upheld, and someone suing a generative AI builder can make sure to sue in a more favorable jurisdiction². In general, courts appear to consider whether someone accessing the site should have been on notice of restrictions. In the case of generative AI training, the large-scale nature of the training may make courts feel that generative AI trainers should have put more than usual care into pulling content from websites. Or not. We’ll see!

A related issue: if a site allows scraping via its robots.txt file but has terms of use disallowing scraping for purposes like gathering training data for a generative AI, we’ll learn if courts decide the terms of use or the robots.txt file controls.

How We Picked Which Sites To Examine

We don’t know for sure which sites OpenAI, Google, Cohere, Midjourney and the like trained on³. However, the set we chose seems like a decent starting place to us. Our selections came via:

  • The Washington Post ran a piece that analyzed Google’s C4 data set. C4 is “a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT).”
  • Some editorial views of ones that we thought might be interesting.
    • We think lawsuits against generative AI builders from commercial websites are more likely, since they are perhaps more likely to desire compensation for using their content. So we perhaps oversampled these a bit.
    • Sites that we ourselves use.
    • While some of us speak additional languages beyond English, we decided to restrict our review to English-language terms of use.

Are These The Relevant Contracts?

We did this study in June and July 2023. The relevant terms of use for lawsuits are probably those in effect when the generative AI providers scraped the websites in question. This scraping will have happened at various times. For example, OpenAI’s GPT-4 main crawl seems to have happened in 2021⁴. For expediency, we did our study based on currently (June 2023) in-effect terms of use. These may be different in important ways from the terms of use in effect at the time they were crawled. Notably, back in 2021, few were worried about generative AIs being trained on their content. Today, many are. We expect lawyers suing generative AI builders on terms of use grounds to pull the correct, then-in-effect terms of use. For this piece, we think it’s fine if it’s directionally correct. So, we feel okay about only looking at the current terms of use.


  1. There are other conditions too, but these don’t seem worth focusing on. E.g., (The website owner would have to sue (some seem happy to have had their content used to train such cool tech)., The website owner can’t have given the LLM trainer permission., Finishes by covering some relevant background information, like whether browsewrap contracts are valid.)

  1. For a TON of high quality content on internet contracting, see Professor Eric Goldman’s Technology and Marketing Law Blog.

  1. Stability AI has been transparent about their dataset, probably making it easier to sue them. Note that the FTC has asked OpenAI to disclose what it has trained on.

  1. In response to the prompt “when is your knowledge current as of?,” GPT-4 responds “My training only includes knowledge up until September 2021.”