Digital Tar Pits - How to Fight Back Against A.I. - YouTube

Digital Tar Pits - How to Fight Back Against A.I. - YouTube

Description

A new movement aimed at poisoning A.I. models like ChatGPT has gained traction after hackers have been attempting to trap said models in a never ending ‘Tar Pit’ of nonsense. After reading an Ars Technica interview, I tracked down a hacker developing tools to poison AI training data. Tools such as ‘Nepenthes’ are designed to confuse and corrupt the models that scrape the internet for their learning. But can we really stop A.I. from turning the web into a mess of low-quality, regurgitated slop?

Ars Technica Interview: https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/
NEPENTHES: https://zadzmo.org/code/nepenthes/

Voice Over: Spencer Devlin Howard

💪 JOIN [THE FACILITY] for members-only live streams, behind-the-scenes posts, and the official Discord: https://www.patreon.com/kylehill

👕 NUCLEAR WASTE WARNING MERCH OUT NOW! https://shop.kylehill.net

🎥 SUB TO THE GAMING CHANNEL: https://www.youtube.com/channel/UCfTNPE8mXGBZPC1nfVtOJTw

✅ MANDATORY LIKE, SUBSCRIBE, AND TURN ON NOTIFICATIONS

📲 FOLLOW ME ON SOCIETY-RUINING SOCIAL MEDIA:
📷 https://www.instagram.com/sci_Phile/

😎: Kyle
🎬: Charles Shattuck
🎞: Kevin Onofreo
✂: Nate Berger
📝: @adef
🤖: @clairemax
🎨: Thorsten Denk https://www.z1mt.com/
🎼: @mey
🎹: bensound.com
🎨: Mr. Mass https://youtube.com/c/MysteryGiftMovie
🎵: freesound.org

Notes

Transcript

0:00

last year I said that the internet was

0:02

dead Before that I called what the

0:05

internet has become a dark forest a

0:07

siloed and dangerous place What was once

0:11

cataloges and travel blog circa 999 does

0:14

now seem to be both dark and dead I

0:18

pointed the finger at Generative AI the

0:20

revolutionary technology that was

0:22

supposed to make all of our lives easier

0:24

but instead heaped an incomprehensible

0:27

amount of slop upon us I'll admit I'm

0:31

pretty fatalistic about it However there

0:34

are now a small group of hackers raging

0:36

against the machine developing tools

0:38

intended to defund and depose the AI

0:41

systems that are sucking us dry The

0:44

internet as it exists now is a

0:46

panopticon trying to sell what it sees

0:48

And what it sees is you Human users are

0:52

a farmed product AI technology is

0:54

clearly in a bubble I want to pop the

0:57

bubble The watershed moment was the

1:00

public release of chat GPT It was a

1:02

large language model trained on without

1:04

exaggeration most of the published text

1:07

that exists on the internet Millions of

1:09

books millions of hours of transcribed

1:12

YouTube videos millions of websites all

1:16

of Wikipedia The model's surprising

1:18

success at replicating humanlike speech

1:21

told every other AI company that if you

1:23

want to compete in this space you'll

1:24

need a lot of data too Once regular

1:27

users realized that this meant the rest

1:29

of the internet even copyrighted

1:31

material the backlash began Hundreds to

1:35

thousands of the most visited most

1:37

maintained websites started adding an

1:39

old technology to their pages

1:42

robot.txt a voluntary compliance

1:44

standard that would restrict access to

1:46

automated AI data scrapers if the

1:49

standard was followed All of this

1:51

happened very quickly in direct response

1:53

to AI scrapers and their voracious

1:56

appetites Of course almost just as

1:58

quickly the world's top two AI startups

2:01

started ignoring

2:03

robot.txt This apparently was the final

2:06

straw for a few anonymous hackers It's

2:08

time according to them to show the world

2:11

that generative AI is all just smoke and

2:14

mirrors I want to pop the bubble

2:16

sabotage to cost them money None of the

2:19

AI companies are profitable and

2:21

poisoning their models to hurt

2:22

performance and hopefully spook

2:24

investors is the goal I decided to make

2:27

this piece after reading this RS

2:28

Technica interview with a one Aaron B

2:31

which I suggest you read in full link in

2:33

the description who used an old cyber

2:35

security tactic known as tarpitting to

2:38

trap AI data scrapers who weren't just

2:40

digesting everything in sight on

2:43

websites They were also pinging some

2:45

sites millions of times a day which

2:48

costs the website developers time money

2:51

and intellectual property The tool Aaron

2:53

created is called Nepenthees after the

2:56

carnivorous plant that digests anything

2:58

unlucky enough to slip inside it Having

3:00

been worried about this problem for a

3:02

while now I reached out to Aaron to

3:04

learn more about what they were doing

3:06

The quotes that follow are Aaron's words

3:08

via our email exchanges read aloud by an

3:10

actor as what they have created is

3:13

deliberately malicious software intended

3:15

to cause harm First what is a digital

3:19

tarpit the first of these programs was

3:22

made by Tom Lon who named it Labraa

3:24

after the real tarpits in Los Angeles

3:27

Tom was trying to stop a worm from

3:29

scanning his IP non-stop Instead of

3:31

doing something illegal Tom decided to

3:34

try to slow down the worm enough such

3:36

that its spamming would be less

3:38

effective and therefore less desirable

3:40

as a tactic Quote "Now you have a chance

3:43

to make their life more difficult." Tom

3:45

explained "Aaron B and Nepenthees have a

3:48

similar if a bit more focused goal

3:52

Instead of rolling over and letting

3:53

these do what they want make

3:55

them have to work for it instead."

3:58

Like real tarpits the goal of Nepenthees

4:01

is both to slow down AI scrapers like

4:03

the Labraa program and to trap them on a

4:06

website indefinitely like some helpless

4:08

prehistoric megapona This costs the AI

4:11

company time It wastes their money and

4:13

it poisons their data sets not just with

4:16

raw data but with

4:18

nonsense This is how it works It's like

4:21

an infinite maze or a hall of mirrors

4:24

You dedicate part of your website's URL

4:26

space a directory or folder to

4:29

Nepenthees and make a link to it from

4:31

somewhere else on your site Then it

4:33

generates a randomized page for just

4:35

that URL that looks like a normal

4:37

website and not something malicious But

4:41

every link generated in one of those

4:43

pages links right back into the tarpet

4:46

This is a demonstration of Nepenthees

4:48

working in real time According to Aaron

4:50

having the page load with a speed of a

4:52

Roadrunner modem in 1997 is on

4:56

purpose The deliberate slowdown has two

4:58

purposes First to prevent the crawler

5:01

from overloading your server The delay

5:03

also hurts the crawler It takes a

5:06

computer's resources to listen for the

5:08

response which while the AI companies

5:10

have massive amounts they still don't

5:12

have an infinite amount And now they are

5:15

wasting some of that waiting to hear

5:17

from the target You'll also notice the

5:19

page slowly fills in Nepenthees is

5:22

trying to send the crawler just barely

5:25

enough to stay on the line instead of

5:28

disconnecting So an AI comes to your

5:31

website not knowing that it's tiptoeing

5:33

around the rim of a carnivorous plant

5:36

Eventually it will find the link you've

5:37

made to the tarpit Once inside there's

5:41

no way out Every single link sinks it

5:44

deeper into the sticky muck Aaron says

5:47

that eventually the download queue for

5:49

the crawler is almost entirely web pages

5:51

from the pit Pages that outnumber the

5:54

legitimate pages on your entire website

5:57

Nepenthees isn't just an infinitely deep

5:59

well for AI to fall into It's a poisoned

6:02

well tainted with so-called marov babble

6:06

The backlash against AI scrapers isn't

6:08

so much about the web crawling in

6:10

general That's how we search the

6:12

internet in the first place It's about

6:14

knowing that your data is being fed into

6:16

something that will turn some of your

6:17

creativity into slop So Aaron added an

6:21

old program to Nepenthees that makes the

6:23

tar pit bubble up babble instead of

6:26

useful data A marov babler is a very

6:29

simple text generation algorithm often

6:31

used as a computer science or

6:33

programming lesson The goal is

6:35

generating new text that is

6:37

statistically identical to the source

6:39

material it's trained on Aaron gave me

6:42

this example Start with a sentence like

6:44

and this and that The Marov babbler

6:47

wants to generate text based on this

6:49

training data that is statistically

6:50

identical 50% of the time the word and

6:53

is followed by the word this and 50% of

6:56

the time it is followed by the word that

6:58

100% of the time this is followed by and

7:01

100% of the time that is the end of the

7:04

sentence So give the babbler the word

7:06

this and it's very likely to spit out

7:08

something like this and that and this

7:10

and that Aaron tells me now imagine that

7:13

instead of giving it a single sentence

7:16

very similar to large language models

7:18

you give it a lot of text to reference

7:20

then you prompt it and if it's only

7:22

trying to output similar sentences and

7:24

not answer questions like chat GPT what

7:27

you get back is nonsense

7:30

This is the tar that Nepenthees feeds to

7:33

any AI ignoring robot.txt on your

7:36

website There isn't much new in

7:38

Nepenthees Honestly dropping the Markov

7:41

generator in was a whim for the fun of

7:43

it I'm unsure how effective it is as a

7:45

poison but who cares let's go I told

7:49

Aaron in our email exchanges that

7:51

frankly it feels like there's some fight

7:53

club style anarchy to these tools Why

7:56

poison them all why burn it all down

7:59

surely there is a less malicious way to

8:01

do this They responded like many

8:04

observers of the tech space are doing

8:05

right now that something has to show

8:08

these giant companies that this space is

8:10

a bubble and it's about to burst AI

8:13

technology is clearly in a bubble It

8:16

doesn't work It's spicy autocomplete

8:18

There's no cognition there It's just

8:21

attempting to guess the most probable

8:22

next word That's where these

8:24

hallucinations and other they

8:26

spit out comes from Humans have an

8:28

innate tendency to assume anything

8:30

showing enough entropy to have free will

8:32

or consciousness LLMs are pretty much

8:35

optimized to trigger it and that's all

8:37

that's happening Do you think a Marov

8:39

babbler will ever be sentient why would

8:41

an LLM ever reach general AI when a

8:43

Marov model cannot it's all just smoking

8:46

mirrors Aaron isn't the only one making

8:49

these tarpets In their interview RS

8:52

Technica also spoke to others who are

8:53

cross-pollinating their ideas with the

8:55

digital Nepenthees plant Another is

8:58

called iocaine after the famous poison

9:00

from the princess bride But ultimately

9:03

right now nepenthees Iocaane and other

9:05

tarpits are old tools being used to

9:08

fight a new giant enemy It's inevitable

9:11

that billiondoll efforts to scrape data

9:14

from every corner of the internet will

9:16

adapt to andor avoid these pits entirely

9:19

But even after his program gained decent

9:21

traction on social media RS Technica and

9:24

other spaces for Aaron full-scale

9:27

digital warfare isn't the goal This was

9:30

less sounding an alarm and more of a

9:32

whim that took on a life of its own It's

9:35

kind of art honestly just an expression

9:38

of rage at what the internet has turned

9:40

into Lashing out because I can

9:43

regardless of effectiveness I will

9:45

definitely continue to maintain

9:46

Nepenthees as long as it takes I'm

9:48

guesstimating via downloads somewhere

9:50

around 50 to 250 instances of Nepenthees

9:54

are currently out there in the wild I

9:56

think it's pretty cool People like it

9:58

and are using it I have included a link

10:00

to Nepenthees in the comments below It

10:02

comes with many warnings and it doesn't

10:05

mince words Its goal is to accelerate AI

10:08

model collapse and see them burn Do not

10:11

deploy Nepenthees if you aren't fully

10:13

comfortable with what you're doing My

10:15

choice in highlighting this work could

10:16

also be dangerous But I wanted to speak

10:18

with Aaron because I'm worried about the

10:20

same things The dead internet in

10:23

shitification AI slop brain rot Like

10:27

social media before it in its current

10:29

iteration AI is a grand social

10:31

experiment being performed on us without

10:33

our consent I don't want to move fast

10:36

and break things when what's broken is

10:38

the social contract Tools like

10:40

Nepenthees are an opening salvo in the

10:43

coming battle for our digital lives

10:46

We can fight back even if we don't

10:49

succeed Be indigestible Grow spikes

10:54

Until next time

11:01

[Music]

11:09

[Music]

11:27

[Music]

11:33

[Music]