Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Can I remove my personal data from GenAI training datasets? (knowingmachines.org)
103 points by randomlogin on Oct 30, 2023 | hide | past | favorite | 117 comments


Not just personal information. Blogs, commercial content, news articles and more. Check out the Allen Institute for Artificial Intelligence's C4 dataset to see if anything you wrote was ingested:

https://c4-search.apps.allenai.org

Love the disclaimer: "The dataset is released under the terms of ODC-BY. By using this, you are also bound by the Common Crawl Terms of Use in respect of the content contained in the dataset."


The article weasels out at the end by claiming that companies “may be unable to comply” with requirements to delete personal data. It’s easy to comply - if there’s no other choice then you delete the model and all backups and derivative data that was trained in flagrant violation of the law.


For most companies “deleting the model” is equivalent to dissolving the company so that is equivalent to not being able to comply. More realistically what they would need to do is exit the market of the country that has such stupid laws.


>For most companies “deleting the model” is equivalent to dissolving the company

I'm okay with that. It's sad that you're not. If you're willing to start a business on such shady foundations, there's a really good chance your business will continue to make shady decisions in the future. It's better to find and remove the cancer early


I just disagree that it's shady. If you put your personal info into public circulation I think it's absolutely morally fine to have a model train on it.


There's a difference between being in public and being available for public use. Books in a book shop are "in public", but only those out of copyright are available for public use. Similarly, blog posts are in public, but not being given away for anyone to resell or include in their product.

For an image/photography example, I may have my photo taken at an event and sign a release or agree to the event's T&Cs that say they can put publicity photos on their social media or website, but I don't intend that my photo will be used by an unrelated party to generate images of my face (this may not actually be restricted, yet, but I'd prefer it to be and I suspect few would consider it a reasonable outcome).


> There's a difference between being in public and being available for public use. Books in a book shop are "in public", but only those out of copyright are available for public use

Copying books can be fair use if it is sufficiently transformative, copyright doesn't block all use of copyrighted material. A major case on this was around Google building a search index for books:

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

Open AI is currently attempting to argue that training a model is more transformative than building a search index, whether they win that argument or not is still to be seen.


Building and granting access to a generative model is far more damaging to the market prospects of the infringed works, which is also a factor. Like…the whole goal is for it to be able to replace what humans do, which is about as market-damaging as I can imagine. I’m not a lawyer, though.


This is true for some types of transformative work, but aggregating to make money off others work is not generally allowed. I think Open AI are in for a hard time with this when people look at the results and see their work clearly.


your data can be public without your consent, what then ?


I may sue the person who made it public without my consent.


Which law is that?


The post that you're responding to doesn't make any claim about the law; it expresses that the defense that an AI company might be "unable to comply" with a command to remove user data is not an honest one.


"flagrant violation of the law" heavily implies that OP has some strong feelings around this, and I want to know what law they think is potentially being flagrantly violated.


There is currently no law in the US or EU that I'm aware of that consider models trained on PII or copyright data to be considered as including that data. This is the core issue everyone is discussing about copyright around models and whether publicly scraped data is fair use for these purposes.

Until this is decided, likely by a long series of trials in high courts, or passage of law through the appropriate legislative bodies (and then likely after many challenges to it) will GDPR and CCPA be able to apply to these language models.

So no this is not currently a violation of any law.


I was daydreaming about other possible ways to comply, but I don't know if they'd really work in practice. Ideas like:

- Supply copywritten keywords and approved responses to try and retrain the model.

- Train the copywritten content again with inverse weighting. I imagine this one wouldn't really work well and still keep the model performance up...


I don't get it, you put that information on the internet, you have no expectation to privacy. But now maybe you learned a lesson, and you won't publicly share things you don't want people to see?


> I don't get it, you put that information on the internet, you have no expectation to privacy.

While that is somewhat true, there are multiple angles to it. The big one is expectations. It used to be that if you shared something on an obscure forum that had like 100 readers, most of those people expected that only 100 would see their posting. And it was true, so it reinforced the expectation for nontechnical people, which is basically everyone within a rounding error.

Then everything gets indexes and now fed to AI and suddenly billions of people have easy access to what they felt was very limited in distribution.

You could file this under "People don't threat model correctly", but also (more importantly) under the fact that technology should consider human nature before breaking expectations.


You should consider the size of the training set as well. A blog post in a 30T dataset like RedPajama has less impact than one in a fine-tuning dataset of 1000 examples. The gradients from all tokens are added up, they stack on top of each other and their influence is diluted in larger datasets.


do you ever save things you see on the internet that have some "scandal" merit, say for example you don't like Donald Trump and you come across a variety of material that's embarrassing to him. Or your school bully, or your gf's ex, or Macron in France, Trudeau in Canada, etc. Would you feel that other people should get to tell you to delete it because you're not allowed to use any aids to help you remember or prove why they are bad people (in your estimation)? Freedom to remember is part and parcel of free speech.


Did you read the article? There are cases where the data was explicitly not publicly shared. Why would this journalist have no expectation to privacy? https://arstechnica.com/information-technology/2022/09/artis...

Even for publicly shared data, there is legislation in many parts of the world (notably the entire EU) that enforces an individual's right to have their data deleted.


Did you read what I said? Anything posted on the internet is public. Just because you think it's private does not make it so.

To that end, you should consider any file on your computer to be public as well if it's connected to the internet, as there are countless ways that data could be exfiltrated.


That may well be true, but it's not what the article is about. The article is about individuals removing personal data from GenAI products. GenAI companies in many places have a legal responsibility to facilitate this. Your concern about all internet-connected data being "public" is only a concern if you're dealing with malicious actors, which is not what's at hand here.


And like I have been saying the argument is foolish. You posted something on the internet. That makes it public. It doesn't matter if you thought it was private. Generative AI companies in the US, afaik, do not have any legal responsibility to remove that training data.

There's no difference between an AI being trained on public data, or a human being trained on public data. Likewise there should be no expectation to "unsee" something someone willingly posted in public.


This is similar to saying that if girls don't want to be molested, they should not wear a mini skirt in public.

One may not have been aware of the future existence of AI when writing on a public platform.

Or one simply may want his blog/thoughts to be intended for human consumption and not to train the latest commercial product which will reap from his content and not share any profit.


How is it similar to that strawman?

There is no expectation of privacy in public. That is pretty much the definition. It’s not victim blaming.


I'm free to take pictures of people in public. I'm not free to monetize / share them freely without consent. There is an argument to be made that this should be the case for stuff people post on the internet as well.


> There is no expectation of privacy in public

This is claimed a lot but absolutely not true. Stop repeating this very obviously false statement.


Obviously false? I’m sorry but “private” and “public” are exact opposites to one another. Being “in public” is quite exactly where you don’t have a general expectation of being “in private”.


Essentially what the sibling says. Does your expectation of privacy change when: you're taking the subway? When you're in an empty car? When you're at a concert packed with people? When you're tucked away trying to escape the crowd? When you're in a public restroom? What about if you're walking alone in the middle of the night and no one is around? What about in your house with your blinds open? Or in your front yard? What about in your front yard if you live 5 miles away from your neighbor?

If your expectation of privacy changes in any way with any of these options, then yes, you have an expectation of privacy in public. It is a spectrum, not a binary decision. But if you think it is a binary decision you're welcome to try to record people going to the bathroom in public restrooms and see what happens. I will not be covering your lawyer fees.


Yes, they are in opposition, but it is a non-sequitur to then associate it with privacy. And most critically, to consider privacy to be a binary attribute, rather than a multidimensional and graduated quality.

Do you believe you have the right not to have a satellite track or drone hover over you in public and share that feed, associated with your identity, with the whole world? I sure do. You should too.


And I want to add how quickly things have changed in this space. In 2005, only 2% of Americans had cellphones. By 2010 that was still under 30%. But by 2016, more than 80%. Not to mention that these phones' qualities have drastically changed as well. In 2010 you could walk around in public and would be very surprised if you found yourself in the background of someone's video posted online. Now I don't think anyone would be shocked, especially living in a big city (definitely would in a rural town). Let alone that video be high enough quality to recognize you. We can probably say the same about the number of CCTV cameras as well. Or even how much data was being gathered on the web.

I don't think enough people are internalizing how radically the world has shifted in the last 10-20 years. 10 years ago generative image models couldn't generate a black and white 32x32 image of a face. Now I can trivially transform a photo of my friend into a dog and make him bark. For most people this technology came out of nowhere because of the lag. The same goes true for other things like the internet, cellphones, social media, etc. Is it really that much of a wonder why there's so much craziness going on right now when so much has changed so fast?

And fwiw, speaking of satellites, there are now multiple companies (not governments. Companies) who's goal is to get a daily snapshot of the entire surface of the earth. They already have cameras where you can see clearly the color of cars from space. There are some where you can see people (albeit hard to see, but you only need to identify a dot to track it). We now should expect that resolution to increase, not only in number of pixels, but frequency. This technology can do a lot of good, just like ML, but at the same time it can be used for a log of destruction and oppression, just like ML. The question is if these technologies are going to go the way of atomics or the way of the internet. But even simply saying this, I, a dumb human who's barely more advanced than our chimp cousins and barely capable of communicating, have a hard time internalizing what this future world will be like, despite being only 10 or so years away.


> saying that if girls don't want to be molested, they should not wear a mini skirt in public

such a ridiculous example if you are trying to convince me that you care about these girls' safety as opposed to some abstract egg-headed right.


The “right to be forgotten” is an explicit right in the EU, which I think applies here.


I'm not saying that this right is a bad thing, but the reality is that it's rarely enforceable. As soon as you put something on the internet, you have every "right" to expect that it's there forever.

No amount of legislation can outweigh the technical reality.


The government has no problem getting rid of CSAM and terrorist material, for practical purposes. It's not a technical problem.


It is obviously a technical challenge. Image yourself trying to facing 3 pictures, one of CSAM, another of terrorist material and just some random photo of random guy eating ice scream: what steps would you do to determine if this last photo is "legal" or not?

YouTube have decades of attempt at determining if uploaded content violates copyright, normally through fingerprinting content submitted by right holders. YouTube still fails at catching all copyright infringement. That's why they are generally protected from prosecution while they demonstrate reasonable attempt that preventing their services from being misused.

Imagine ALL IMAGES IN THE WORLD being submitted for fingerprinting. How about malicious or erroneous submitions that taint datasets?

Also: right to be forgotten is a horrible misleading name. At most you have right to request data to be removed from datasets. And only if you have some legal basis that demonstrate that your data is in that dataset. It is not a right to be "forgotten". Imagine you yourself being told to "forget an image, sound, etc", how is this enforced? You may even ask for proof that you know the content you are being told to forget. How do you do this?


The “right to be forgotten” works nicely in tandem with the “right to know” - where you get to ask companies to tell you what they have on you.


If you have the image and the person reaches out saying it’s them and they want it removed, you have a pretty good idea of the legality. As stated, this isn’t a technical problem it’s a bureaucratic one. Most companies didn’t think they had to remove it so didn’t have steps in place to do so. Now a lot of them do, with GenAI just being the latest group of companies who feel they have an argument in favor of not having to do so.


They only get rid of those materials that they can seize. I doubt they will be able to seize a properly hosted onion domain. It's just that most of the actors aren't good enough with technology.

That's the reason you mostly see the pretty dumb guys getting caught. As long as you are smarter and more tech-savvy than 80% of the criminals, you are pretty much out of reach for the feds.


The government has enormous trouble getting rid of CSAM. The FBI had databases of content that is decades old - basically images recognised as "classics" that are still shared amongst the consumers of such things.

It is illegal to possess, but despite the most aggressive enforcement on the planet it is still out there.


In case of CSAM, the trouble is in how to identify who have a copy of it.

In case of personal data in this article, they have already identify which legal entity have it.


> One artist even found that her private medical records containing photos of her facial condition were in LAION, a foundational dataset for many GenAI products.

From the article I'm extremely skeptical people have actually identified anything, since LAION does not store content. LAION is a list of links to content hosted by others, publicly.


> As soon as you put something on the internet, you have every "right" to expect that it's there forever.

Maybe realistically any uploaded content can be found somewhere on the internet forever (I doubt that, tbh), but this doesn't mean we should throw our hands up and say "it's impossible to regulate".

We already have precedent for this in the form of copyright/DMCA takedown requests. The MPA/RIAA has a legal avenue to protect their content, why should it be any different for individuals?


Americans are not bound by EU laws. We appreciate forcing iPhone to USB-C, but you can keep the rest of your authoritarian laws.


EU rights do not apply in the US.


True but globally operating corps are subjects to all kind of national laws and as soon as you have any presence in the EU those regulations apply.

An online shop in Germany once removed Cuban cigars because of US trade sanctions.

ClearView just declared to ignore the right to be forgotten, even for EU citizens. It will be interesting to se how this plays out. I guess it will take a few years…


No one is implying that. But companies have to comply with laws in jurisdictions where they operate. OpenAI and other generative AI companies certainly operate in the EU.


The 'right to be forgotten' isn't a right.


> you put that information on the internet, you have no expectation to privacy

That was the expectation until Zuckerberg came along. People were putting tons of stuff on BBSs, intranets, and other stuff 20+ years beforehand because people had the idea of "consent" back then existed: using someones information for a purpose they didn't intend was (and still is) wrong.


Using someone's information for a purpose they didn't intend is wrong? What? This just seems obviously mistaken. I don't agree with that at all. I'm not even sure how to argue about that.

Lots of art, science and technology can be considered as "using someone's information for a purpose they didn't intend." It is extremely normal for people to find new uses for things; information is not exempt from this.


> Using someone's information for a purpose they didn't intend is wrong? What? This just seems obviously mistaken.

Yet you would expect that people won't record your private conversations, and even less play it back to third parties.

Exchanges, of whatever information, whether it is personal stories or administrative data occurs in some kind of trust on what kind use you consent to. Respecting this consent is a social obligation on the receiver end.


> Using someone's information for a purpose they didn't intend is wrong? What? This just seems obviously mistaken.

How? It seems obviously correct. Suppose we trade nudes. You now sell my nudes online without my consent. You bet I have the right to sue you.


I think there's a fundamental difference between information passed in confidence for a specific purpose and information made publically available with an intended range of purposes. I do not believe that the latter is morally protected from unintended use.


Sure, but as is a big discussion in this thread, that public private distinction is quite messy. We don't have just private chats and old twitter style where everyone can see everything. Facebook has posts which you can share with only friends, or with even more limited groups. This changes your perceived privacy levels and expectations. We have various group chats with widely different expectations of privacy.

But RTFA, because what it is discussing is not public facebook/twitter posts. I'm not sure why the discussion keeps moving there as if that's the only case. Besides, a public facebook post is still only visible to those on facebook. It is as public as a private country club, even if it's easy to get access. What about what you write in copilot? What about your code on github? In a private repo? The article mentions this is already protected under current law as personal information.

The fundamental problem here is about consent. What would a reasonable person expect? 10 years ago certainly no one expected this. Mass gathering of data was only for the biggest of tech and governments. Now it's in the hands of anyone that wants to train an LLM. Things changed very fast and I think a lot of people are forgetting that.


> The fundamental problem here is about consent. What would a reasonable person expect? 10 years ago certainly no one expected this.

I just don't think that's the right angle. I think if you're in the public sphere, the way that you anticipated your creations to be used just shouldn't matter at all. If you've given something to the public, the public should not be limited to whatever you thought they'd use it for.

And the conversation keeps coming back to that because people keep making that argument. I specifically agree that private Facebook posts are not covered under this argument! However, I also think that consent as a basis for analysis just doesn't apply to public posts.


Why does everything have to be binary these days? We aren't machines. The world is messy and so are we humans. Removing nuance from conversations and ideas does not help anything. The scenario you laid out, continued, leaves no person with privacy anywhere. I specifically don't want to explain because I specifically want more people thinking and thinking hard. To specifically attempt to break from their way of thinking and try to see things from a different point of view. This is important especially when there is disagreement. You don't have to be convinced by my arguments, but you certainly do have to understand them. There definitely isn't enough of that happening these days. Our interconnectedness has only made the world more complicated, not more simple. It has made communication more difficult too, as we talk with many who have far larger differences in priors than we would were we constrained to in person communication.

As for consent, you may disagree, but that's how our entire legal framework works, and I for one am happy about this. Consent is the foundation of any democracy. A world without consent isn't figuratively tyranny, it is literally tyranny. Agree or not, the law does not give you the right to use any information (words spoken, pictures taken, bits, or whatever) that is publicly available/viewable in whatever way you want. There are rules. It is why Taylor Swift can't even post a picture of herself on Instagram and why you can't follow someone around all day taking pictures of them (aka. stalking). There are limitations because we live in a society and I for one think this is a good thing.

Yes, you are right when posting something public that people shouldn't limits on use for only what the poster explicitly intended. But that freedom is neither unlimited. That's the point I'm making. The line of public and private is getting blurrier by the day. These laws are always made by that of a reasonable expectation. Certainly no one that posted on Facebook 10 years ago had a reasonable expectation that their post would be used to train an LLM.


Personally speaking: all the art I consume is remixes. Almost all the books I read are fanfics. If the repurposing of cultural content without the consent of the author becomes illegal, my cultural universe will vanish overnight. As such, I think I'm just on the other side of that fight.

If we have to lose privacy and control in order to keep our participatory culture, I'm sorry, I'll take culture. If it's a multidimensional topic, I'm for tightening privacy in private spaces and loosening privacy in public spaces. If there's only one axis, I'll push the lever towards loose.


> Personally speaking: all the art I consume is remixes. Almost all the books I read are fanfics. If the repurposing of cultural content without the consent of the author becomes illegal, my cultural universe will vanish overnight.

You'd be surprised because these industries have figured these problems out. What is allow without license, what needs royalties, what needs permission, and what can't be done. This was a huge conversation in the 90's and you can find many discussions about hip hop (around sampling) and copyright law.

Your culture is not in danger, even if you don't know its history. I'd suggest learning it though, because that's the best way to ensure your culture stays out of danger (a culture I am also part of fwiw).


To the best of my knowledge, all claims that fanfiction is in the clear, that it's okay to do derivative works if you aren't charging money, and so on, are all just fandom folklore. If any author wanted to legally shut down fanfic, to my knowledge they could. And I'm against that - hell, in my opinion fanfiction writers should be able to sell their work even without authorial consent. Look at the Touhou scene, look how much cultural production was achieved just by one popular creator being chill about copyright. The idea that you can own exclusive rights to a story or a setting is a historical aberration.


> if you aren't charging money,

Yeah, that's to the best of my knowledge true but IANAL. I'm with you about the fanfic scenario too. I'm also very open about sampling. But I think these issues are different than the data that we're talking about in this thread. If data can be recovered (an in some ways it is, others it isn't) then that's not really derivative. Derivative also needs some distance and not be too close. And importantly, these data are being used to create a product that is being sold. Where the processing of the data is the thing of value.

My point is that the environment has changed and there's a lot of gray area here. Turning this into a binary distinction is unhelpful. There's new nuances here and there were new nuances when sampling became popular in hip hop. We need to have open and honest discussions about these things, and I think a lot of discussions we have or observe are rather dismissive of these nuances (coming from both sides of the debate fwiw). I'm obviously very open to using data but we must also be aware of our data privacy, how it can be used, our social contracts, and what a reasonable level of a priori consent is. If we overly simplify these conversations then they aren't actually discussions. My points are to this, that there is gray and that there are very clear cases where you do not have unlimited access and usage to works that are publicly available. Private ownership is the root of capitalism afterall, and so it should be rather unsurprising that we have many laws and social contracts over ownership and the extent of what one may do with things they did not create. There certainly is a lot of anger and frustration in these conversations and I don't expect an artist making their living off of their art to understand all these nuances nor am I surprised that they are upset and possibly afraid. This is new territory and pretending it isn't is just as obtuse as calling generators fuzzy copy-paste machines.

But I want to be clear that we can have both goals. We can protect data rights, privacy, fairness AND have this sampling and creativity culture, for lack of better words. We just need to be careful, nuanced, and thoughtful to determine how to do this though. We won't be perfect and won't make everyone happy, but we can maximize social agreement conditioned under fairness and privacy. I just want to ensure that we are not approaching this conversation as that there are clear answers in what can be done with data and what can't be. Hell, we don't even have that answer for music, sampling, or fan fiction. We have answers as to what laws say, but even as you point out, that's ambiguous in many cases without even considering that the environment is not only changing, but changing rapidly. I think we all understand that there is a difference between using the Akira slide compared to the "Ice Ice Baby v Under Pressure" scenario. No one has the answers, and that's why we need to talk. And unfortunately "edge cases" are the norm in topics like these.

Note: I am an ML researcher. I use publicly available data to train models that have images of people, their art, their animals, their property, and such that I'm sure many do not know exists in these datasets. Similarly I do not even know all the data within some of these datasets. But I can still recognize that there is a gray area that exists here and personally I see it as my ethical duty to ensure we have these discussions in an open and honest way to determine the limits of what I should and shouldn't be able to do. It isn't up to me, it is up to our society to create a social contract.


Yeah well, I'm part of society and I've said my opinion on the matter. :P

Sure, it's a continuum. I'd say it's more that, viewed as a continuum, my opinion about the ends of the spectrum is binary, in that I want one end to go up and the other to go down. That doesn't mean I want to split it into two cases so much as that I want to consider it among other things on an axis between two points, "private" and "public", and my opinion is a function of placement between those two points.

Artists are frustrated about fanfic too! They feel genuine ownership of these characters. I've seen people have borderline breakdowns at fanfics. The idea that people are doing things that they didn't intend to "their" characters can be genuinely traumatic to a writer. If you make that a question about consent then fanfic becomes comparable to abuse, and I just don't agree that those should be the terms. Any work placed before the public is inherently participatory. Copyright splits the world into "creators" and "consumers" and I think that's problematic, because there's no such thing as passive consumption of a work. The listener or reader recreates the experience of the work in their head, based on their own assumptions, preconceptions and personality, and if they have an experience or viewpoint that diverges from that of the author, they must be able to share that in turn. So I have a genuine values disagreement with some of what we currently call "rightsholders" in the matter.

What's that got to do with deep learning? Very little: I just see where it leads if you phrase the topic about consent, and it's not a world where unwashed readers can get their dirty mitts on their characters.

(My personal favorite is... closed-source Minecraft (Java) mods. "Do not decompile against my wishes!", the description said. Minecraft mods, meanwhile, are only possible in the first place by decompiling Minecraft's source code against Mojang/Microsoft's wishes. You were this close to learning something!)


Consider the thought experiment that you give your postal address to some business, because you want to subscribe to regular grocery deliveries. Then you notice that each delivered package contains a small transparent bag with some with powder in it.

Whoever treats others' information can only do so with a clear purpose and for a defined time period according to current EU laws.

If, however, you were referring to information that you published on the Internet for everyone's benefit then you would still need to consider intellectual property rights. In the open source software world we have the licenses that deal with this, and then there are copyright laws protecting content providers (not making a case here whether they are good or not).

I guess what makes a difference is if there is some business involved either in the production or in the consumption side of the equation and if we accept that "machine learning" is the same as "human learning".

EDIT: separated paragraphs, typo


Copyrights protect the copying of the original text, but models take gradients. Are gradients protected as well? Even when data is copyrighted it still has legitimate value for training, pure ideas don't get copyright protection, only expression.


Can the model use the text without making a copy (in memory) to process it?


technical copies are different, there are exemptions


Can a human?


Not relevant. What humans do isn't necessarily viewed the same way as what machines do.


Machines need to be prompted with the exact prefix to be able to retrieve any copyrighted fragment, and that doesn't work most of the times. So the intent for copyright circumvention is in the prompt. Temperature settings also matter.


That could be besides the point as for the training the machines will create a copy in their memory - or is that no longer the case?


I don't believe transient copies are the real issue here. Having a temporary copy stored locally in memory is more of a technical nuance than anything. That copy remains isolated and is not being distributed or shared. In contrast, using something like BitTorrent actively spreads copies across the internet.

There are also cases where loading copyrighted content is unavoidable in order to even view the license terms. For example, a website could deny the right to copy its content before you even see the licensing details. Temporary copies enabled through normal usage like this should be permissible.


I think this really comes down to the specifics of the situation.


For displaying/viewing any digital device needs to create an in memory 'copy'. If you would extend copyright to that extreme, it would become completely nonsensical imho.


Not a lawyer, but if you show a streamed movie to an audience without having any rights, how would that be covered? Only by considering the stream provider?


> That was the expectation until Zuckerberg came along.

This isn't my recollection of a pre-facebook internet


> That was the expectation until Zuckerberg came along.

The early days of the Internet were full of "information should be free" hackers, many of whom roam this very forum in their 40s, 50s, and 60s.

Well, information is free now. And this is what it looks like - consumed by LLMs. We got what we asked for.


Indeed. There was no expectation of privacy on the internet at the time and THAT is exactly why it was standard to not communicate personal information you didn’t want to share. Pseudonymity carried onto the internet until the Facebook days.

I am still firmly in the Information wants to be free camp, and I agree with you that LLMs scanning it is a logical continuation of it all.


This seems like a fantasy. Anything could end up in 4chan with a photoshoped penis on it since a long time before Zuck came along.


4Chan came online in 2003. Facebook was 2004. I'm not sure what your point is.


People didn't put their private medical records on the open internet. They may have uploaded them to a service they thought was respecting privacy but was either selling their data without disclosure or was just straight up incompetent and uploaded private-user-data.zip to a public Dropbox share.


> you put that information on the internet, you have no expectation to privacy.

But if Disney puts information on the internet, and you misuse it, lawyers will chase you to the ends of the earth.

We both have same right and copyright, but somehow it's one rule for them and another rule for me.


GPDR and many data rights exist today, it is still applicable to AI.


> you put that information on the internet

That isn't necessarily true. Someone else might have put the information on the Internet. That could be someone else uploading a photo of you, or records of your home purchase on a city's public records, or an obituary or wedding announcement that listed you as family, etc.etc.


This is wildly disingenuous. I don't even know where to begin.

Services are advertised as private. So why wouldn't people be acting like they are private. Oh, you're not tech literate and so don't know something "obvious?" Guess it's your fault for listening to your financial advisor and losing all your money too. We live in a specialized world. You don't want to go down this route and it is unreasonable because we can't have infinite knowledge and infinite time to gain that. There's absolutely no way you know in high detail: tech, medicine, economics, law, science, politics, or any other expertise. You don't even know one of these, but rather a small section of one.

Yes, they may be "public" but you can have an expectation of privacy in public. Also your lack of expectation in public is not infinite. The lack of privacy in public is an astonishingly new phenomena. Realistically in the last 10-15 years as we've seen cellphones become prolific. Things were way different just in the early 2000s, where there weren't nearly as many cameras either by CCT or just in peoples' pockets. Let alone microphones or high resolution images.

Most importantly, the environment has changed on the internet too. 10 years ago the only people scraping up the entire internet was big tech and govs. Now it's every startup and group of people that can. These are very different situations. 2015 image generation was still a joke. Goodfellow's GAN paper was posted in late 2014. That's not even 10 years ago that we're struggling to generate black and white faces at 32x32 pixels.

Things are VERY different and if you think you could have predicted this I'm going to call bullshit. Only Captain Hindsight could do that. This kind of thinking is simply just victim blaming and lording your particular expertise over others while ignoring your own ignorance because you let post hoc thinking rewrite your past.


More often than not it's other people putting the information on the internet. Celebrities and politicians already have had to deal with this but at least had some compensation/resources to deal with it. We need some protections or else existing in public spaces will become only viable for the rich. For everyone else you'll be 'caught' picking your nose in the sidewalk and then be fired from your restaurant job or whatever. Not to mention the ramifications for public protest.


That is an assumption. No one has ever agreed on such concept on paper.


There are a lot of aspects to this though. We write code for solar power plants, some of it is open source or at least publicly available. What if we wrote something that turned out to be bad and it was used by someone else through the AI and it broke their plant… would we be responsible?

Now, you’re probably thinking about this from a reasoning or technical perspective in which case it’ll appear to be a ridiculous concern… because it is a ridiculous concern. That’s not how our legal department sees it though. They see it as risk mitigation, and they actually take it rather serious.


I do have an expectation to privacy. It is possible to secure information if so desired.

Do you expect your online bank to secure your data? Your email provider? Your healthcare provider? So you do expect some privacy. I just expect some more. I believe that companies should not own my data just because they provide me with services off it. From that, consequences follow.

I have a better proposition: let's slap companies that don't respect privacy with fines until they too learn a lesson.


In general, we need a data bill of rights. Corporations are enslaving us digitally, and I feel it's reaching a boiling point.


There are zero incentives for them to comply and zero ways a person can make them accountable. If your SSN can be leaked and nothing will happen why would they care about scrappable pictures?

The only times they care are when it can cost them money. SD didn't care about visual artists but could not do the same for generative music since the rights are managed by deep pockets.


Frankly, and not directed personally, but that’s not really true.

Government can make the incentive — and has. Legislation can put the teeth in societal goods like this even if financial incentives don’t.

California has done exactly this with their right to be deleted law.

https://www.foley.com/en/insights/publications/2023/10/calif....


The two parties are the state and the data brokers. The individual has no way to do this on their own. But you're right, it doesn't have to be true. Maybe someone will have precedence and sue a company. Until then I am convinced it is true.


Fair enough. I respect your logic.


AI might lead to more paywalled sites which would suck.


Depends, will it reduce the number of ad driven sites? Could be a good thing.


Your favorite forum/blog/app might be ad driven.

And it’s not like ads will disappear; if anything, we’ll get AI ads.


There's plenty of AI being trained by hobbyists, artists and enthusiasts too.


[flagged]


I didn't say anything remotely close to what you're implying. What a ridiculous statement. I used the word enslave and used it correctly in this context.

https://www.merriam-webster.com/dictionary/enslave

https://www.merriam-webster.com/dictionary/slavery

a situation or practice in which people are entrapped (as by debt) and exploited; submission to a dominating influence

From Oxford Dictionary: cause (someone) to lose their freedom of choice or action


Is this elementary school, where we ignore the usage and context of words and simply recite the dictionary definition? You were clearly aware of the emotional weight of the term and tried to exploit it to make your argument more impactful.

Can I call my mom a slave-driver now, because she made me clean my room when I was a child? I clearly had no choice but to submit to her demands, since I relied on her for food and shelter!


OP used the word correctly. It doesn’t HAVE to mean specifically “it’s like being a North American slave in that exact period of history” or something.

define:enslaving (2): cause (someone) to lose their freedom of choice or action.


When people hear slavery, they think of harsh, forced physical labor under inhumane conditions. Whether you think of American slavery, African slavery or Roman slavery, the common factor is clear.

I'm from Germany, I'm decidedly not US-centric, but everyone I know would be appalled if someone described any facet of their western life using that word, because they'd be implicitly reducing the impact said word has when it is used to describe the immense suffering of those people experiencing it in its (sadly still present) worst form.


> When people hear slavery (...)

When I hear "enslaving", especially in the context of the sentence "Corporations are enslaving us digitally", I don't think "oh my god this guy is trying to say that our plight is exactly like that of roman slavery", no.

If you hear that, then I submit that this is for you (or rather, dr_dshiv) to fix. "respond to the strongest plausible interpretation of what someone says", as the guidelines say.

Now you in this conversation quite clearly understood the usage of "enslaving" to be its second definition, yet you choose to still criticize it by reducing it to its first definition.

There is no synonym you'd like for "removing one's rights and liberties". If you find one, feel free to suggest it, but my guess is that the people living in slavery now don't really care about this nonsensical argument a couple of well-off dudes are having over language.


> When I hear "enslaving"...

Where have I ever used the word "exactly", friend? My point was that people intuitively recognize the emotional impact and importance of words like "slavery", and that the commenter was exploiting that well-earned impact to artificially inflate the importance of their comment.

> If you hear that...

The other side of good communication is expressing yourself in a clear, non-manipulative way instead of playing on people's emotions.

> Now you in this conversation...

Again there is a difference between understanding the meaning rationally and interpreting the seriousness of a statement intuitively. If I start calling every annoyance I face slavery, then the word will quickly lose most of its meaning and impact, since the dictionary definition is broad enough to apply to an entire range of issues, from tiny to massive.

> There is no synonym you'd like...

How about just using language that is specific to the issue being discussed? We're talking about a company retaining some personal information, how about we call that a violation of privacy? "Removing one's rights and liberties" as a general statement implies a high bar of seriousness, of significant violations of significant rights. I don't need to call myself a "survivor of human rights abuses" because Google stored some metadata about my shopping habits. I can simply say that my privacy was not being respected.

And again, I never made the point that enslaved people may feel offended by a post on Hackernews, I argued that this type of inflationary usage destroys our (as in well-off dudes who have the ability to help people) ability to intuitively judge the seriousness of an issue. It's the same with words like "racism", if we call every guy who sings along to rap music and uses the n-word a racist, then we implicitly support actual racists by compromising the seriousness of the term.


The complete lack of power or ownership of your work.

Do you think anyone consented to have AI companies scrap GitHub etc?


Eh, you borrowed from society to create 'your work' and now you want to lock it up like you've created it wholly.


By that standard no copyrighted works have any protection, except clearly that isn’t the world we live in.


How I wish it was


Absolutely, I am happy to give back to society if AI overlords do the same - make models and datasets opensource.

Kinda hypocritical to infringe on copyright while expecting protection themselves


No, it's the generative AI companies that are taking from society all the data and locking it up in their opaque models and selling that as if they created it wholly.


They consented if the data was public. Opt-out scanning practices is a whole other beast.

But unfortunately, unless you challenge the ToS in court, you consented the moment you didn't delete all of your data after the policy update. Is that fucked? Sure, and maybe we need regulation around that, but in the face of current legislation, consent was granted.


That’s factually incorrect. GitHub’s ToS don’t give them any rights to use your information in this way.

Their argument is they don’t need any such permission, but they can profit from this data by selectively allowing large scale scrapping by 3rd parties thus breaking their TOS by selling your data.


GitHub's ToS expressly allows them to change the terms of their service at will, and they do indeed publish these changes, and your continued use of their website is considered consent.

Furthermore, as [0] states:

> We may use your information to provide, administer, analyze, manage, and operate our Service. For example, we use your information for the following purposes:

> ...Improve and develop our products and services including to develop new services or features, and conduct research...

[0] https://docs.github.com/en/site-policy/privacy-policies/gith...


They didn’t limit copilot to active users who therefore have implied consent to ToS changes.

“Improving and developing products and services” doesn’t cover what they did. Which is why that line isn’t being brought up in the ongoing litigation.

Further, at scale they are well aware users didn’t own the copyright to all uploaded data making their ToS insufficient shielding for this use. Which is why they aren’t even trying to suggest the ToS enabled this kind of use as that suggests it would be required and opens themselves to massive amounts of liability.


Failing to actively say "no" is not the same thing as saying "yes". If your point is that the law means nothing unless someone is willing to file lawsuits, your point is taken.


I just finished getting downvoted in another thread for writing these very words. Couldn’t agree more.


Legally they can't use your data without your consent and that data must be able to be deleted by your demand. So for GenAI to use personal information is bad and they should change its design. I advise they obfuscate the data to not contain sensitive information if they are intending to create AI around such data. Identifiable information is not a good thing.

GDPR and many other laws are still applicable to what GenAI is.


Now's a good time to scrub your online identity before things get worse.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: