If you run a WordPress website or have a blog on Tumblr, you’ve probably produced and published a sizable amount of content there. While we all know the internet isn’t “private,” you probably posted those texts and images thinking they were yours, and wouldn’t be stolen by the very companies you relied on to host them.
As it happens, WordPress and Tumblr are preparing to do just that. As first reported by 404 Media, the parent company for both sites sites, Automattic, has a entered into a deal to sell user data from Tumblr and WordPress to AI companies like Midjourney and OpenAI. The AI companies intend to use the data to train their systems.
As if that weren’t bad enough, preparations for the sale went poorly, and it seems large categories of Tumblr posts that weren’t supposed to be sold were added to the mix anyway. That data includes:
-
Private posts from public accounts
-
Posts on deleted or suspended accounts
-
Unanswered asks
-
Private answers
-
Explicit posts
-
Posts from partner accounts, like ad campaigns where Tumblr doesn’t own the rights. (Apple is specifically named here.)
It’s possible this data was not actually sent to OpenAI and Midjourney, and that it was simply identified and cleared for that use. However, 404 Media could not confirm this. They could confirm, however, that password-protected posts, direct messages, and media identified as CSAM were not in the bunch. So…that’s good.
It might not be all WordPress sites
Table of Contents
Automattic specifies that only WordPress.com sites are affected by this data scraping, as opposed to content created on the WordPress CMS that you might use with a site hosted elsewhere. In theory, your WordPress CMS sites not hosted with Automattic should be safe from these actions.
That said, 404 Media could not confirm whether using Automattic plugins like JetPack would bring a self-hosted site into Automattic’s scummy data-sharing policies.
You don’t need to be OK with Automattic selling your data
A source tells 404 Media that Automattic will be adding a new setting for its properties on Wednesday to allow users to opt-out of selling and sharing data with third-party companies. The outlet received a copy of a new FAQ section, which details that this opt-out option will block crawlers from accessing your sites if you enable it “from the start.” If you choose to opt-out later, Automattic will contact partners and “ask” that they remove your content from their datasets and training.
This wording is not particularly encouraging. However, whenever Automattic does release this opt-out option, I suggest you use it on your Tumblr and WordPress sites anyway.
Following the 404 Media piece, Automattic published a statement saying it blocks major AI platform crawlers, and updates its lists to add new ones; has features to block search engines from indexing your sites, which can also discourage AI crawling; and that they only share public content hosted on WordPress and Tumblr from sites that haven’t chosen to opt-out. That said, they admit no laws exist to prevent crawlers from abiding by these preferences, and that they are working with certain AI companies, “as long as their plans align with what our community cares about: attribution, opt-outs, and control.”
What will AI companies do with this data?
Companies like Midjourney and OpenAI require huge datasets to train their AI systems. Programs like Midjourney and ChatGPT wouldn’t be possible without pushing enormous amounts of information their way: It’s how they “learn” how to do the things they do.
So your WordPress blog posts filled with your favorite recipes can be fed to generative AI models to train them on how to “talk” about food (or anything at all); your photo dumps on Tumblr can train models on how to recognize subjects like a car or a bird. The data from all your sites, plus the sites of millions more users, is invaluable to AI companies, which means it’s extremely valuable to the companies that own those sites, and can sell it. Automattic will likely make a ton of money on this deal, just as Reddit will likely make a ton of money on its own AI content licensing deal with Google.
It’s fun to post and share on the internet, but it might be about time to take back what’s yours: If you don’t own the platform you’re sharing your original ideas on, consider taking them to one that you do own, before your ideas become training wheels for artificial intelligence.