Whilst spidering is nothing to worry about (and only to be expected on a public site), the way the association fires off legal threats based on this spider results alone seems wrong. Since this spider does not actually look at the whole title of the file, or even it's content, I figured I could have some fun at their expense:
What if I could write a `tarpit' script that could create a large number of interlinked automatically generated web sites. If their spider tried to scan my server it would be fooled into thinking that it had found a treasure trove of MP3 sites. Anybody who took the time to look at the site could see that the site contains no pirate content at all.
How might the RIAA react to such a thing?
They could upgrade their spider so that it only recognises valid tracknames that are in-fact MP3s. (e.g. it would know that `elephant_wiggle-Madonna.mp3' is not a real Madonna song). This would limit their ability to detect only correctly named MP3 files, and force them to use their spider responsibly.
- Every single suspect site would need to be hand-checked in order to verify that a genuine breach of copyright has taken place - this would substantially decrease the return on investment for their spidering project because it would be labour intensive, again forcing a more responsible approach to detecting offenders.
They could blacklist my server to prevent their spider from looking at it in future - that would be at least a small victory. If they blacklisted enough servers it would be the same as giving up!
They could send me a legal nastygram instructing me to disable my tarpit... Since I do not live in the USA, this might not be enforceable.
How it works
The Pit of Confusion is a pure PHP script that can automatically generate a very large number of web-sites with links to MP3s. It contains a settings file which contains lists of famous artist names and random words that can be used to make silly song titles. There is also a download manager component - designed to deliver MP3 files in the most inefficient possible way.
As with any web-site, the action starts with a URL. Normally, the first parts of the URL just signifies the server on which the site runs, however I have used a Dynamic DNS service to encode the two key site parameters into the hostname. I learnt that trick from this website. The first two parts of the domain name tell the script how to build the page: If you visit:
It will show you `Ricky's' Madonna page. The script does not know anything about Madonna or any of her songs - it just uses information provided at run-time to set up the basic variables. Anything in the form of a.b.music.stodge.org will get handled by the same server.
Notice how slowly the page loads - that is because there is a configurable `annoying delay' built into each transaction. Assuming that the spider system has a fixed maximum number of threads, it makes sense to tie these up for as long as possible - but not so long as to deter a person wishing to verify that there are no pirated files on the site.
Next it builds up a list of randomly named MP3 links that include the the chosen Artist's name in the title. If you try to click on the link, instead of delivering a pirated file it sends a non-copyrighted music file via a download manager that ensures that the download will take a very long time. The idea is to tie-up as many threads as possible on whatever system is doing the spidering.
Finally it makes some links to a selection of other random sites produced by the same system. The idea is to keep the spider in the tarpit for as long as possible
This is just my first attempt. No doubt, by now more talented scripters can see weaknesses in my plan - this is why I intend to share the source-code of my project with anybody who wants it. If you want to help out, please leave a message in this board and I will get back to ya!