- cross-posted to:
- [email protected]
This is the best summary I could come up with:
In a blog post, OpenAI said website operators can specifically disallow its GPTBot crawler on their site’s Robots.txt file or block its IP address.
“Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies,” OpenAI said in the blog post.
For sources that don’t fit the excluded criteria, “allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.”
Blocking the GPTBot may be the first step in OpenAI allowing internet users to opt out of having their data used for training its large language models.
However, OpenAI won’t confirm if it got its data through social media posts, copyrighted works, or what parts of the internet it scraped for information.
Sites, including Reddit and Twitter, have pushed to crack down on the free use of their users’ posts by AI companies, while authors and other creatives have sued over alleged unauthorized use of their works.
I’m a bot and I’m open source!