Ask for forgiveness, Not permission
All machine learning models have been built on real world data, the question being asked is what data have these new LLM's like ChatGPT been built on, and who gave them permission to use it?
A guiding principle behind most startups is that it’s better to try the thing, push the boundary and achieve something great, then later ask for forgiveness if a line was crossed. Rather than ask for permission to cross that boundary.
Famous examples of this approach are ride hailing companies like Uber or short term home-stay provider Airbnb.
Uber improved accessibility, availability and lowered the cost of cabs across the globe, but did it by ignoring many local labour and public transport laws.
Airbnb allowed homeowners to make additional money off spare rooms in their homes, whilst improving availability of beds in many cities and offering unique experiences. They did this whilst ignoring local tourism laws, and launched a class of property investors focused on short term holiday stays ultimately raising property prices for locals and pricing them out of their own neighbourhoods. All whilst expecting guests to clean their rooms after charging them a cleaning fee.
Both organisations have used the glacial pace of government regulations to move into a market, grow market share, and only face censure once entrenched with a position to negotiate from.
This behaviour is not new though. Companies like Starbucks have been testing the edges of legislation long before Airbnb or Uber.
It is a common strategy amongst any startup looking to disrupt the status quo. If a company is looking to be disruptive in a market, asking for permission to do something that would disrupt that market in most cases in a non-starter. The approach can be pushed too far through, with Uber and Airbnb are two of the more egregious examples of this strategy.
Schrödinger’s AI
How does this apply to LLM’s and ChatGPT? Despite being called OpenAI, the creators of ChatGPT have very closed about how they gathered the data they used to create the LLM. Their privacy policy is also very open, with users effectively giving consent for OpenAI to use any information they give them, whether explicitly (ie prompts), or implicitly (ie IP and User-Agent data).
If you want to use ChatGPT to summarise confidential information, you’re effectively leaking that information to OpenAI. If you’re working in a regulated sector, like banking for example, or listed on a public stock exchange, that leak is something that needs to be reported to a regulator.
A recent example is at Samsung where three separate incidents had to be disclosed publicly. The OpenAI answer to this is not to make the model open and available to allow folks to run their instances of it, but rather to partner with their investor Microsoft to sell you a private instance of ChatGPT as a service that won’t use your inputs to train the core model.
The lack of transparency from OpenAI on what happens to the data, and where it was collected from has led to Italy placing a temporary ban on ChatGPT until the existing privacy policy is expanded to give more detail on what is done with the data.
Cookie Monster
So where has the original model come from? This is the chicken and egg problem for AI. The model can be refined by capturing user prompts and additional information, but the original model needs external data. The recent change to Reddit’s commercial model around API access gives us a clue. LLM’s are very large models, and work partially by virtue of the sheer amount of data used to train them.
They need billions of parameters to be effective. This data has most likely come from publicly accessible websites that OpenAI can scrape, or access programatically via an API. Whether they have the rights to use the source data is highly questionable, and cannot be verified without making the model open to understand what the source data is.
OpenAI and it’s investor Microsoft have already been tested on this boundary with the release of Github Copilot, a GPT powered code assistant, that was trained on all public code in Github, regardless of the associated license for that code. The ongoing lawsuit will most likely set a precedent, that will apply to ChatGPT, until definitive legislation is drafted that provides better guidance.
Right now OpenAI are looking to push the use of ChatGPT to make it entrenched in as many organisations as possible, which will give them leverage and a position to negotiate from when the time comes to define legislation that would potential limit their ability to gather data to train their model.
OpenAI is pushing the boundary, knowing they will most likely be asking for forgiveness down the line. That point of forgiveness, when legislation is created to cater for ChatGPT, has the potential to redefine copyright law, and the definition of derived works in unintended ways.
Uber and Airbnb have had many unintended consequences. The unintended consequences of OpenAI are yet to be fully seen, but have then potential to upend the broader creative industry in good and bad ways.