Last week we hosted a webinar with Jessica Roberson and Jeremiah Colonna-Romano from the University of Alabama on how they were experimenting with OpenAI's GPT models for their archival metadata workflows. In the Q&A Jessica and Jeremiah were asked:
Did you think of any privacy or copyright considerations when you were putting this content into ChatGPT?
I had to chime in because this is a frequent question when I’m talking about AI with library folks. It’s important to think about and discuss, as Jeremiah said, “what kind of exposure is your project and group willing to take.” I also think there is a lot of FUD – fear, uncertainty, and doubt – when it comes to understanding what data is being used for what purposes.
To overcome that fear, uncertainty and doubt, I encourage everyone to read OpenAI's three privacy statements:
- Enterprise Privacy, which governs ChatGPT Enterprise, ChatGPT Team, and the API. If you’re writing programs using OpenAI’s models or using third-party software that uses OpenAI’s models, this covers you.
- Privacy Policy, which covers the services that are not their “business services” (the ones covered by the Enterprise Privacy above). If you’re using a free or individual paid version of ChatGPT, this is what covers you.
- EU Privacy Policy, applies for users of the free or individual paid version of ChatGPT if you reside in the European Economic Area (EEA), Switzerland, or the UK.
If–like University of Alabama–you’re using their API in scripts, they explicitly say:
We do not use your ChatGPT Team, ChatGPT Enterprise, or API data, inputs, and outputs for training our models.
And
[Y]ou retain all rights to the inputs you provide to our services and you own any output you rightfully receive from our services to the extent permitted by law.
If you're using ChatGPT, the web based interface, then the rules are different. They are keeping that information and using it to help train their models. The Enterprise Privacy Policy does a better job of saying this than the end user Privacy Policy:
What sources of data are used for training OpenAI models?
We also use data from versions of ChatGPT and DALL·E for individuals. Data from ChatGPT Team, ChatGPT Enterprise, and the API Platform (after March 1, 2023) isn't used for training our models.
OpenAI wants businesses to use their service. And businesses do not want their data stored elsewhere and incorporated into something anyone else can use. What that means is that OpenAI’s incentive is to keep API and enterprise data private so people will continue to use their models. They aren't pulling that data and using it to train their models.
Many of you run digital library systems that publish metadata and text to the web. Those sites are being crawled for training large language models all the time. So if you're using ChatGPT to help you describe an item that you're just going to put on the web... Does it even matter?