Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Hacking Google Print

By isometrick in Internet
Tue Mar 08, 2005 at 05:13:53 AM EST
Tags: Software (all tags)
Software

Many people are curious about the inner workings of Google, but they are mostly interested in keeping it a secret. So, any information we can glean comes from "black box" analysis. Recently, I wrote a short article that explains how I wrote some simple code that can instantly create PDFs of entire books from Google Print. Maybe some k5ers have more ideas on how to expand the concept, but, as you'll soon find out, I probably can't help in the efforts. Read on ...


Introduction

Many people are interested in how Google works, and Google is mostly interested in keeping it a secret. I'm going to tell you a few things I've learned about Google by playing around with their software . It's not terribly advanced, but I think it's interesting nonetheless. The first thing I will cover is Google's cookie, and then I will explain how I used this information to exploit Google Print.

Google's Cookie

Most web browsers allow small text files, called cookies, to be stored on behalf of web servers ... this allows a persistent state to be associated with a user. After a cookie is created, it will be sent back to the web server every time you request a page (but only when you request a page from the server that originally requested the cookie). For example, when you set your SafeSearch preferences on the Google web site it stores your choice in the Google cookie. Then, whenever you request a page from Google it can see what you set your preference to earlier and use it without having to ask you again. If you delete your cookie, you'll just get a new cookie the next time you visit ... but you'll have to set your preference again. Pretty useful, huh?

Google does some more interesting things with its cookie, though. Some of them are hard to figure out. The first thing to notice is that your cookie will store some preferences locally, like SafeSearch, because Google probably doesn't care if you see that information (it won't bother you to see that they are storing your preferences). Otherwise, they probably have a server side system that uses the following characteristics of the cookie to store more .. ahem .. personalized information about you.

Here is an example of a Google cookie:

GPREF=ID=26b2149fe108b391:TM=1109736400:LM=1109736400:S=pbbDWyL8tVmJrILc

You can see that after "GPREF=", there are name-value pairs ID, TM, LM, and S (separated by colons). In this case, our ID is 26b2149fe108b391. This is a (hopefully) unique ID, and it is most likely generated randomly. Google probably doesn't worry about "collisions" (two users getting the same ID) because this is a 16-digit hexadecimal number, and there are 16^16 = 18446744073709551616 = 18.44674 x 10^18 possible IDs that could be assigned. Even if everyone on the planet used Google, the chance of collision would be very low. Google's cookie has an expiration date of January 17, 2038. Essentially, unless you purposely clear your cookies, format your hard drive, etc. this means it will be with you for a very long time.

The TM value is a timestamp of the moment (to the second) that Google generated your cookie. Here it is 1109736400, measured in seconds since January 1, 1970, or March 1, 2005 at 10:06:40 PM (CST).

LM seems unimportant because it is a timestamp of when the user last changed their preferences. Many other name-value pairs can appear, but the only others that I have seen represent more preferences. Having the unique ID means they are most definitely storing *something* on the server side, but don't worry it's probably only analyzed in aggregate unless you are one of Sergey's ex-girlfriends :-p.

Now, S is the most interesting value in the cookie. Some have hypothesized that it is a checksum of some sort. It could be a hash, for instance. In my experience, the signature only varies with different ID and/or TM values. Thus, Google is assured that THEY generated the cookie at a given time by doing a simple calculation of the hash. But relying on a pure hash would be security through obscurity, i.e. Google would basically be relying on the secrecy of the hash function. Instead, I think that Google probably uses a digital signature algorithm of some kind to generate it. So, maybe S stands for signature. It appears that the signature is 16 characters long, case-sensitive, and alpha-numeric only, giving (10+26+26)^16 possibilities or roughly the equivalent of a 93-bit hash (not incredibly strong by today's standards, but definitely a good chunk of hash). I tried my luck at guessing a hash function and mapping parts to base 62 numbers, but I just don't think that they are stupid enough to do it that way. Sucks for me, because I'm no Bruce Schneier when it comes to cryptography. My instinct is that an attack against the signature would be futile.

Now, the payoff. Well, after another explanation that is :) What are some reasons that Google needs to know you by an ID and when your cookie was created?

Google Print

Google Print URLs are of the form:

http://print.google.com/print?id=VvBRboW2icUC&pg=1&sig=hoLj_9Ot12vG6mSjZ vK547vbP3E

Anything look familiar here? Another signature! Maybe this one is generated in a similar manner (then again, maybe not ... they are probably different teams). The ID in the URL points to the book that you are viewing, and PG points to the page number. Now click the "Next Page" arrow. You'll get a URL like:

http://print.google.com/print?id=VvBRboW2icUC&lpg=1&pg=2&sig=gBBbI6T 0FzHxgVeJJQKQqmZ_MNk

The signature changes when you change pages, and LPG points to the page you started from! Eventually, you will not be able to advance through the pages any more. Google wants to limit you on the amount of pages you can scroll just so you can't read an entire book for free (that would make the publishers very unhappy, and here's my sad face for it :(). Try removing LPG and going to the resulting URL. You'll get a "page not found". So, apparently, the signature depends on the page you entered on, the page you are at, and the book you are viewing. This allows Google to impose their "page lookahead/lookbehind" limit.

You may see a search box on the side of the Google Print page to search within the book. Funny enough, you can use this to search for page numbers and skip through parts of the book. However, it will eventually hit a hard limit based on your unique ID ... i.e. you've viewed too many pages overall in this book; nothing to see here, please move along. Google probably already knows you can skip around pages like this because the search box doesn't appear unless your cookie is 24 hours old or more! Try it, if you have the search box now delete your cookie and refresh the page. The box will disappear! If you saved some of the URLs for the search results of the search-in-book feature, they will also not work! Wait 24 hours or so and try again. Now it works. Here's where the timestamped cookie comes to play. This way, if a user hits the hard limit they cannot clear their cookie and come back instantaneously to leech more pages.

So recently I wrote some software to grab and store up a bunch of cookies, keep them for more than 24 hours, and then automate searching for pages by this method. If I wanted to view page 100, the software would search for it and attempt to extract the image with a regular expression. If that doesn't work, it will search for page 99 and extract the "next page" link to get to page 100. It will continue doing this for page 101, 98, and 102 until it finds the correct page. Whenever a cookie would hit the hard limit, I'd replace it with a new cookie from the queue. By grabbing the "next" and "previous" links automatically in this "inductive" fashion and using the search for skipping, I could view an entire book on Google Print with one click every time. I later modified the software to spit out a PDF of the book. I used simple components like GoogleCookie (cookie with accessible properties), GoogleCookieOven (queue with "baking time", i.e. it only pops when the head of the queue is old enough to get the ability to search), and GoogleCookieBaker (thread that keeps the oven full of baking cookies by querying Google for new ones when the number drops below a certain threshold). Theoretically, if you set the cookie limit to a high enough number, the new cookies being fed in will have aged enough by the time you need them. This is a lot simpler than breaking an unknown digital signature algorithm, but of course that solution *would* be a lot more elegant. Oh well.

I sent a link to this software (web-based) to Google, and I actually got a response! I think they actually changed up some things because the software broke now and then, but it seemed to work consistently after many algorithm improvements. I'll be honest, I was trying to impress them to get in on some of that rumored Google goodness (a.k.a. a job), but they ended up just sending me a Google t-shirt and some pens. Oh, and a note reminding me that my software violated their terms of service.

Instead of giving you the software, I'll give you a general overview of the algorithm and the regular expressions used to extract information:

(read code snippets at the page due to formatting issues.)

The regular expressions are not too advanced, but here they are (I think these are POSIX compliant, but don't quote me on that, and they may be really poorly written since I'm still learning the ropes on regexes):

Individual search-in-book results (captures URL only): <a href="(http://print.google.com/print\?.*?)">
Next page link on book viewing page (captures URL only): <a href="([^<>"]*)"><img align=middle alt="Next
Previous page link on book viewing page (captures URL only): <a href="([^<>"]*)"><img align=middle alt="Previous
The image URL on the book viewing page (it's hidden in CSS): \.theimg\s*\{\s*background-image:\s*url\(\s*"(.*)"\s*\)
Parsing Google's Cookie: \APREF=ID=([a-f0-9]{16}):TM=(\d{10}):LM=(\d{10}):S=([a-zA-Z0-9-_]{16})\z

Conclusion

Sorry, I'd distribute the software (open source, of course), but I don't want to incur "teh wrath of teh google." They have billions of dollars, and I'm a college student. Oh well. Back to the drawing board. Hope you enjoyed this little write up!

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Google
o short article
o Google Print
o the page
o Also by isometrick


Display: Sort:
Hacking Google Print | 101 comments (68 topical, 33 editorial, 0 hidden)
-1 DNPIB (1.80 / 10) (#7)
by WetherMan on Mon Mar 07, 2005 at 02:15:39 PM EST

Do Not Post Intro In Body.
---
fluorescent lights make me look like old hot dogs
Can we actually use google print consistently? (none / 0) (#17)
by ultimai on Mon Mar 07, 2005 at 07:10:18 PM EST

A way to actually use google print consitently would be nice too. In all my searches (including the ones given as examples on the http://print.google.com/ page) have come up with nothing.

Yes. (none / 0) (#25)
by isometrick on Mon Mar 07, 2005 at 09:05:20 PM EST

The main Google print site gives examples of this. I quote:

<i>For example, when you search on "Books about Ecuador Trekking" or "Romeo and Juliet," and we find a book that contains content that matches your search terms, we'll show links to that book at the top of your search results.</i>

[ Parent ]

US only? (none / 0) (#47)
by basj on Tue Mar 08, 2005 at 09:13:34 AM EST

The example searches only worked for me when I used a US-based proxy.
--
Complete the Three Year Plan in five years!
[ Parent ]
Interesting ... (none / 0) (#50)
by isometrick on Tue Mar 08, 2005 at 10:01:02 AM EST

Interesting find, basj. Where are you from (general area)? I had a friend in India and a friend in Australia who were able to use the links successfully.

Google has been known to restrict things by location, so it doesn't surprise me. Thanks for pointing this out!

[ Parent ]

the Netherlands (none / 0) (#55)
by basj on Tue Mar 08, 2005 at 10:48:39 AM EST

The fact that I couldn't use google print from the default http://www.google.nl page didn't really surprise me though: things like the translation tools and google news aren't available by default either. But if I change my default language to 'English' those services DO become available (still from a Dutch server, AFAIK). So I expected the same thing to happen with google print, but nothing came up.

Perhaps it is just a matter of time before the new software/data gets propagated to the Netherlands/Europe? That's sort of interesting even. I'll tell you when google print results start to show up. If they do.
--
Complete the Three Year Plan in five years!
[ Parent ]

Me too, but it works for me... (none / 0) (#57)
by Anonymous Cowpart on Tue Mar 08, 2005 at 02:08:02 PM EST

I tried it from the Netherlands, too (xs4all), using www.google.com, and I searched for "oliver twist" (without the quotation marks) and I did get a link to google print. Maybe you only set your language preference to English. Instead, choose 'go to google.com', then try again.

[ Parent ]
You're right! Thanks! [n/t] (none / 0) (#66)
by basj on Tue Mar 08, 2005 at 08:44:16 PM EST


--
Complete the Three Year Plan in five years!
[ Parent ]
Tried too. (none / 0) (#58)
by caine on Tue Mar 08, 2005 at 02:39:21 PM EST

Sweden here. Not only did I have to use a US-proxy, I had to tell Firefox to report US-en as my primary language. Guess they have it like that since they haven't checked the legal situation in all countries. But publishing book excerpts is legal in Sweden so shouldn't be a problem.

--

[ Parent ]

A secret (none / 0) (#79)
by merrymissfuckingpoppins on Wed Mar 09, 2005 at 11:30:03 PM EST

The Google print service isn't available to the public yet- and won't be until their database of books is large enough to warrant the service.

[ Parent ]
Not quite (none / 0) (#80)
by isometrick on Thu Mar 10, 2005 at 12:11:28 AM EST

The Google print service is quite available. Search the posts here for examples of how to access it. However, it has been noted that SOME localities cannot access it without a) changing their default language to english and/or b) using a US proxy.

Thanks.

[ Parent ]

I have no idea what the fuck you're talking about. (1.04 / 23) (#18)
by kitten on Mon Mar 07, 2005 at 07:52:26 PM EST

And nowhere in the article do I see anything that would make me care to know, either.

-1 for that. Another -1 for "teh wrath of teh google". And a final -1 for basically saying "I know something cool but I'm not telling you".

A combined total of -3 for this miserable, dreary disaster.
mirrorshades radio - darkwave, synthpop, industrial, futurepop.
OK (none / 1) (#26)
by isometrick on Mon Mar 07, 2005 at 09:07:55 PM EST

I'm sorry if you did not like the article. Many people think it is interesting. I'm sorry if you do not appreciate the humor, it may be a little too lowbrow for you.

However, I described the algorithm in detail. What more could you want to know? Any software developer worth his weight in beans can now implement this.

I just don't want to get sued. I do, however, want to share the general method with everyone.

Thanks.

[ Parent ]

meow (1.33 / 3) (#37)
by ciremrut on Tue Mar 08, 2005 at 12:08:02 AM EST

no litter box

[ Parent ]
re: kitten (none / 0) (#73)
by C Montgomery Burns on Wed Mar 09, 2005 at 01:18:01 PM EST

how long have you been around here?  and you still don't know how to make an editorial comment?

--
ALL GLORY TO THE HYPNOTOAD
Intelligent design
[ Parent ]
disturbingly obnoxious Google Print bug (none / 0) (#33)
by Blarney on Mon Mar 07, 2005 at 10:33:33 PM EST

I'm seeing quite a few comments in this thread where people dispute the actual existence of Google Print. Probably this relates to a certain bug which seems to be present with the service.

If you do a Google search for "Pride and prejudice", WITHOUT the quotes, you will get a Google Print link at the top of your search results. But if you enclose "Pride and Prejudice" in quotes, as you would ordinarily to request that Google provide an exact phrase match, there will be no Google Print link.

So if it's your habit to search this way, you won't see Google Print at all. I did just report this bug to the Google Print feedback, hopefully it will not be present forever, perennially confusing people into believing the service to be a bizarrely prankish rumor.

Thanks (none / 0) (#34)
by isometrick on Mon Mar 07, 2005 at 10:38:57 PM EST

Thanks Blarney, I did not think of that. Hopefully that is the problem that people are having (although most seem to be able to find their own way).

I'm glad  you took the time to figure this out instead of flying off the handle.

[ Parent ]

Region (none / 0) (#70)
by ffrinch on Wed Mar 09, 2005 at 04:10:05 AM EST

Neither one works for me, unless I use a US proxy. (I'm in Australia.) It's not surprising they haven't bothered checking the copyright status of every book in every country, but it's still bloody annoying.

-◊-
"I learned the hard way that rock music ... is a powerful demonic force controlled by Satan." — Jack Chick
[ Parent ]
Writing -1 + Content +2 = +1 (1.00 / 4) (#35)
by quincunx on Mon Mar 07, 2005 at 11:48:21 PM EST



-1, Computer Hacking. (1.03 / 28) (#39)
by the ghost of rmg on Tue Mar 08, 2005 at 12:21:34 AM EST

If it's not liberal propaganda it's some sort of illegal computer related activity. Reverse engineering, port scanning, and all the rest of the hacker tools are just the sort of thing Kuro5hin wants to put in the hands of whatever yahoo with internet access decides to show up and take them.

Frankly, I'm tired of worms, tired of increasing music prices, and tired of my second favorite search engine getting screwed with by a bunch of anarchists whose idea of a good time is bilking people out of money, then bitching when they try to recover it.

This is a -1. There are many others like it, but this one is mine.


rmg: comments better than yours.

I waited ... (none / 1) (#40)
by isometrick on Tue Mar 08, 2005 at 12:27:20 AM EST

If it's any consolation, I waited a few months before posting this on my blog and k5. They had plenty of time to fix the issue after I notified them.

Please also note that, at Google's request, I am not distributing the software. If someone has the necessary skill to implement this using the descriptions of the algorithms, then they would have thought of it on their own eventually anyway.

Thanks.

[ Parent ]

Oi, fuckwit... (none / 1) (#44)
by mirleid on Tue Mar 08, 2005 at 06:11:46 AM EST

From the google print FAQ:
I think I found a bug - who can consign it to oblivion?

Since we're still testing the product, you may indeed find bugs ('glitches' that haven't yet been worked out). If you find any good ones, or see anything else you think we could improve, please let us know about it. We welcome user feedback. In fact, at this stage of a product's development, we rely on it.

And loose the sig: using a quote from "Full Metal Jacket" in that context is just plain offensive...


Chickens don't give milk
[ Parent ]
Reverse engineering (none / 0) (#56)
by p3d0 on Tue Mar 08, 2005 at 12:38:51 PM EST

Where did everyone get the idea that reverse engineering is illegal or unethical? The only reason you aren't allowed to do it sometimes is that you have agreed to license terms that forbid it.
--
Patrick Doyle
My comments do not reflect the opinions of my employer.
[ Parent ]
This is a 0. There are many like it... (1.60 / 5) (#71)
by pwhysall on Wed Mar 09, 2005 at 07:12:32 AM EST

...but this one is yours.

You're played out.

Time to move on.
--
Peter
K5 Editors
I'm going to wager that the story keeps getting dumped because it is a steaming pile of badly formatted fool-meme.
CheeseBurgerBrown
[ Parent ]

CDs are expensive (none / 0) (#76)
by Nosf3ratu on Wed Mar 09, 2005 at 02:25:01 PM EST

because record companies are greedy.

Your troll is dull and so is this obligatory reply.


Woo!
[ Parent ]

if it's not fundie propaganda, it's some sort of (none / 0) (#93)
by elpapa on Mon Mar 21, 2005 at 06:55:23 PM EST

Ignorant comments from the ignorant side of middle class, usually self proclaimed "professionists" petty criminals, scammers and other wannabie with a little more money then the average.

People bashing, social security syphoning, God praying while scamming their own simpleton neighbours...that's the bunch of people who think they deserve all the goods of freedom without any cost or risk.

They scam hard working honest simple people of money with their people-friendly attitude,  explaining how copyright is good for people when it only benefits very very few people..explaining how drugs is bad when they sell legal, but not less dangerous drugs in their shops.

The most interesting thing, they think Republicans support  small business like them ! WAKE UP that's history there are no Reps in charge now..the remaining "Reps" are being cornered by a new wave of politicians that show how HARD they are on corporate by forcing them not to say "dirty words" on radio or TV.

Next, take every dime from Social Security. How curious, always singing how good the old times were...they quickly forget Social Security made the old times A LOT better..because the past sucked a lot, expecially if you were a little worker with very little money.

[ Parent ]

Did I miss something? (none / 0) (#94)
by freemumia on Mon Mar 28, 2005 at 08:33:25 AM EST

"Liberal"!? Did I miss something! As a pristine Latino, I notice I am saddened!!! Wouldn't you agree, it's like 1931 all over again. UNLIKE you and Ann Coulter, I am not in love with theft. Say no to lies AND censorship!!! Say no to our government of the pigs, by the fundies, and for the DEA!! U.S. get out of developing nations immediately! I'm sure, I am not one of Mel Gibson's society of rabid yes-men!! You know, George W(armonger) Bush only wants IRAQ for the oil. LMAO! Unless I'm crazy, when the sadists say "saving marriage," they really mean "capitalism"!!!! When they say "Federal Marriage Amendment," it goes without saying it is just a code word for "fear"!!!

[ Parent ]
Nice. (none / 0) (#42)
by Polverone on Tue Mar 08, 2005 at 01:47:02 AM EST

Ideally, though, you would have kept vewwwy quiet (hunting wabbits) about this until you had 50 GB or so of complete books accumulated, then upload them to Usenet. There are lots of web services with imperfect access control measures, but this service has (or will have) a lot of stuff worth taking. It would be a shame if they improved the controls thanks to public discussion of their weaknesses.

For your next trick, can you show how to use cookies and regexes to turn Google Groups into the nice Usenet archive it used to be and not the abomination it has become?
--
It's not a just, good idea; it's the law.

For now, thank god for Canada (none / 0) (#46)
by duffbeer703 on Tue Mar 08, 2005 at 08:58:44 AM EST

http://groups.google.ca/ is still the improved 1.0 version of the service.

[ Parent ]
Nice article (3.00 / 3) (#45)
by m a r c on Tue Mar 08, 2005 at 06:43:56 AM EST

Very interesting read. Now that you have informed google, I wonder what they can do about it? How can they differentiate between a user with 'baked' (like yr use of language on this one) cookies and several different users? A heavy handed way would be to record the IP of the user, and if consecutive accesses to the same book with the same IP but a different cookie it might indicate use of a program such as yours.
I got a dog and named him "Stay". Now, I go "Come here, Stay!". After a while, the dog went insane and wouldn't move at all.
by IP would probably help. (3.00 / 2) (#48)
by delmoi on Tue Mar 08, 2005 at 09:23:11 AM EST

I'm sure this guy was using the same IP address for each request. Blocking by IP does have some downsides, like screwing over users who are using proxies, though. I'm sure they could think up some good huristics to prevent people from doing this based on IP filtering.
--
"'argumentation' is not a word, idiot." -- thelizman
[ Parent ]
Good ideas (none / 1) (#52)
by isometrick on Tue Mar 08, 2005 at 10:14:36 AM EST

The idea of heuristics and IP based restriction came to me also. The problem with IP based restriction, as delmoi noted, is that users with proxies are screwed. At many schools (including mine) thousands of people use the same IP address. There are probably many other institutions where this is also the case.

Google more than likely doesn't want to restrict a potentially large part of their customer base (especially academia, because I think Google Print is largely targeted at professors and students).

I also thought a little bit further, and I decided that this IP based restriction might be easily thwarted with P2P. Imagine that you and 9 of your friends are running a client copy of this software. You decide you want the first 150 pages of Pride and Prejudice. With the possibility of caching ignored for simplicity, you grab pages (n), ..., (n+14); your first friend grabs (n+15), ..., (n+29); and so on. Problem solved (I think).

This is exactly the kind of discussion that I want to see, though. Thanks for your feedback!

[ Parent ]

Tor (none / 0) (#65)
by Aero Leviathan on Tue Mar 08, 2005 at 08:18:11 PM EST

Or if you pipe your software through something like Tor, you get... lots ;P
~ Aero
[ Parent ]
Code images (none / 0) (#74)
by Aimaz on Wed Mar 09, 2005 at 01:35:49 PM EST

You know those little images that some sites use to make sure you're not a bot. They have some letters or numbers in a box with something to make it harder for OCR to do over the top. After an IP has accessed a large number of pages they could use that to check you're not a bot and if you're not re-enable it for another amount and keep checking like that.
Aimaz -----
[ Parent ]
Yep (none / 0) (#75)
by isometrick on Wed Mar 09, 2005 at 01:40:08 PM EST

It's called a captcha. They use them on gmail ... but in general it's quite a hassle for the user. I wondered about this too. Good call!

[ Parent ]
Defeating captcha (3.00 / 4) (#83)
by rusty on Thu Mar 10, 2005 at 11:23:04 AM EST

My absolutely most favorite internet-related story of the year: defeating CAPTCHA with free porn.

____
Not the real rusty
[ Parent ]
I don't use the acronym often, but ... (none / 0) (#84)
by isometrick on Thu Mar 10, 2005 at 12:05:30 PM EST

ROFL. :)

That is quite possibly the funniest workaround ever. It seems like the porn/spam industry is always a couple of steps ahead of the rest of us ... how devious.

Heh.

[ Parent ]

Where is the google print... (none / 1) (#51)
by anmo on Tue Mar 08, 2005 at 10:08:34 AM EST

... search page? I can't find a link in the FAQ and I tried everything.

Some notes (none / 0) (#53)
by isometrick on Tue Mar 08, 2005 at 10:21:07 AM EST

Anmo,

Some users have complained that Google Print access is not enabled in their country. I checked with a few friends from different countries, but (obviously) I couldn't check every country.

You may try any of the following links:

Another user was able to use a U.S. proxy to bypass this restriction. Hope this helps!



[ Parent ]
Simple fix? (none / 0) (#59)
by StangDriver on Tue Mar 08, 2005 at 03:00:21 PM EST

If they want to restrict the pages you can read, why not just cutoff the book after page 15? Meaning, no matter how many pages they have you tagged for, you cant read past page 15.

That sounds overly simplistic so what am I overlooking?

How? (none / 0) (#60)
by isometrick on Tue Mar 08, 2005 at 03:32:15 PM EST

How would they cut off the book after page 15?

Remember, their goal is to allow the entire book to be viewable and searchable, not just to give a preview to encourage purchases like Amazon.

The only way they have to identify you individually in order to restrict viewable pages is by IP and the ID/timestamp pair in your cookie. The inadequacy of these is discussed in the article and other comments.

[ Parent ]

smaller database... (none / 0) (#78)
by merrymissfuckingpoppins on Wed Mar 09, 2005 at 11:26:24 PM EST

Because that would limit the number of pages available, which would limit the amount of searches...

[ Parent ]
And in related news .... (3.00 / 2) (#62)
by isometrick on Tue Mar 08, 2005 at 06:19:29 PM EST

I don't want to jump the gun too soon, but it looks like my site has been blacklisted on Google!

Searching for "greg duffy" returned my site first yesterday, now it is nowhere to be found ... if I can't be Googled, do I exist?

Wonderful. I hope this isn't a precursor of things to come ...

I just want to clarify ... (none / 0) (#63)
by isometrick on Tue Mar 08, 2005 at 06:32:31 PM EST

That I don't know this for sure ... it just seems like insanely convenient timing.

[ Parent ]
You're probably right... (none / 0) (#64)
by astopy on Tue Mar 08, 2005 at 07:08:28 PM EST

Searching for gregduffy.com gives the message "Sorry, no information is available for the URL gregduffy.com". I guess Google don't appreciate you sharing this.

[ Parent ]
Nahhhh (none / 0) (#67)
by Maserati on Wed Mar 09, 2005 at 02:12:37 AM EST

"gregduffy.+com" gives about 605 results. It can't be that bad.

--

For the wise a hint, for the fool a stick.
[ Parent ]

Not quite (none / 0) (#68)
by isometrick on Wed Mar 09, 2005 at 02:18:55 AM EST

It seems that I'm still in the index (for instance, you can see my site's results by site:gregduffy.com), but for most keywords of any kind (a few worked even yesterday) you don't see my site.

Try searching for Greg Duffy on MSN, Yahoo, or just about any other search engine. Then try Google. Yesterday I was at the top of Google for that exact search.

Seems suspicious to me that it just changed today ...

[ Parent ]

Guess you ticked off the beast ... (none / 0) (#72)
by Chakotay on Wed Mar 09, 2005 at 09:57:38 AM EST

... and now it's come to get you the only way it knows, or can.

And you know, if I were Google, I would do the same. Why would I let my own search engine serve up content allowing to hack me? Sure, pure white and bright fluffy bunny flower power make love not war, that's a no-no, but the world is not pure white and bright fluffy bunny fl... ahem, you get the picture :)

--
Linux like wigwam. No windows, no gates, Apache inside.

[ Parent ]

just the other day they were caught (none / 0) (#82)
by kpaul on Thu Mar 10, 2005 at 09:58:35 AM EST

cloaking...

i hope it is just a coincidence on your part because i had that pagejacking article a while back.... my traffic from Goog dropped about a month or so before i wrote that, tho...


2014 Halloween Costumes
[ Parent ]

Nope. (none / 1) (#86)
by GoogleGuy on Thu Mar 10, 2005 at 04:52:56 PM EST

Hi, I'm an engineer at Google. I thought I'd stop by and clear up this "my site has been blacklisted on Google" misconception, because it hasn't. Earlier today I emailed someone from our crawl team, just to verify that Google didn't do anything on our side. Here's what I got back as a reply:

"His site has been unreachable from time to time. The most common case appears to be failure to resolve DNS for his site. The last time we tried to crawl his homepage (and many other pages on his site) was on the 6th of March and were unable to get access on multiple tries. Since we are unable to get his homepage, we were unable to determine outgoing links and gradually lost coverage. We have had the homepage in the index in the past (as recent as Feb 24th)."

Hope that helps. Just to reiterate, Google didn't take action of any kind against your site.
GoogleGuy

[ Parent ]

Thanks, here's an update ... (none / 0) (#87)
by isometrick on Thu Mar 10, 2005 at 05:44:34 PM EST

Update

[ Parent ]
I saw that too (none / 0) (#88)
by rusty on Fri Mar 11, 2005 at 07:03:56 AM EST

It actually didn't resolve for me when I tried to look at it after reading the thread parent. Thanks for the info.

____
Not the real rusty
[ Parent ]
Yeah, (none / 0) (#89)
by isometrick on Fri Mar 11, 2005 at 12:29:42 PM EST

I wish people would put the conspiracy theories to rest. I guess it's OK to talk about the general "what if" issue, but I'd just like to say that Google is NOT sitting in their Mountain View Lair of Doom trying to find ways to make me disappear.

It is, of course, quite possible that there was some error between them and my DNS (I don't know where the error was, but I'll be talking to Affinity more in the next few days). The only thing that I've asserted is that I don't know how or what ... and that is just seemed weird.

GoogleGuy is a legitimate poster from Google, as you've noted (I saw his info on my website too). I think he is telling the truth as he knows it, and I think the most likely answer is that this was a technical problem.

[ Parent ]

2600 (none / 0) (#69)
by Jebediah on Wed Mar 09, 2005 at 02:53:27 AM EST

Thanks. I have no clue about some of this but it reminds me of the old 2600 mags I used to flip through. Good stuff.

create a p2p to retrive entire book? (none / 0) (#77)
by chucklarge on Wed Mar 09, 2005 at 05:24:28 PM EST

seems like you could expand your idea and create a p2p that farms out the page requests. by equally distruting them, you can help to avoid the hard limits. just an idea...

Google print is gone? (none / 0) (#81)
by flute on Thu Mar 10, 2005 at 07:47:44 AM EST

Looks like someone at Google has read this story. Google print no longer seems to be working for me.

Here's a followup on some of the issues ... (none / 1) (#85)
by isometrick on Thu Mar 10, 2005 at 01:32:41 PM EST

Followup

the first rule of hacking google (none / 0) (#90)
by soundproofing on Fri Mar 11, 2005 at 06:38:31 PM EST

is you do NOT talk about hacking google.
soundproofing, noise control, vibration damping, and acoustics consultant and engineer. http://soundproof.mine.nu/
Anonymous use of Google Print (none / 0) (#91)
by Milly on Sun Mar 13, 2005 at 01:45:11 AM EST

"Having the unique ID means they are most definitely storing *something* on the server side, but don't worry it's probably only analyzed in aggregate unless you are one of Sergey's ex-girlfriends :-p."
As a rule, Sergey's ex-girlfriends take precautions ;)
"It appears that the signature is 16 characters long, case-sensitive, and alpha-numeric only, giving (10+26+26)^16 possibilities"
Both hyphens and underscores are definitely allowed too. Er, I'm no Greg Duffy, so you do the math ...

Thanks for the analysis, from which I've added a section on the potential drawbacks of becoming largely anonymous to Google.

Using a zeroed (or otherwise anonymized) cookie GUID doesn't preclude using Google Print, but it seems that if you browse a book at Google Print, and if someone else has been browsing the same book in the last 24 hours (quite a coincidence), you might find that the content viewing limits have already been reached, or will be reached sooner than otherwise, and you might have to wait 24 hours to read some more.

Also, using the GoogleAnon bookmarklet will attract a reset of the date of the cookie next time it is read by Google, (because the signature/hash will be incorrect: so the server will generate a new date and signature/hash). The anonymous ID will not be overwritten, but if you've used GoogleAnon and visited Google within the last 24 hours, the 'Search within this book' feature may be missing, even if another zeroed cookie user hasn't browsed the same book (but you'll still be able to browse pages/snippets yourself, until you reach the normal content viewing limits).

I guess very heavy users of Google Print might want to save a few old cookies themselves. And of course it's entirely incompatible with your clever hack ;)

That wrinkle apart, I think most people could use an anonymized cookie and Google Print without difficulty.

In a nutshell... (none / 0) (#92)
by Vesperto on Sun Mar 13, 2005 at 03:22:28 PM EST

<quote>I'll be honest, I was trying to impress them to get in on some of that rumored Google goodness (a.k.a. a job), but they ended up just sending me a Google t-shirt and some pens. Oh, and a note reminding me that my software violated their terms of service.</quote> I find the last sentence particulary amusing.
_____________________________
If you disagree post, don't moderate.
Not a Premium User.
Also Skeptical about Google (none / 0) (#95)
by bulk sms on Thu Apr 28, 2005 at 08:24:58 PM EST

I too am sceptical of the inner workings of Google. Although i am yet to experience it myself, i have heard from others that google is placing adverts on people site's without their permission. Even adverts of their competitors. As Google likes to keep their work a secret, this kind of research IS useful, but i doubt we'll ever know the truth unless one of their employees gets overcome with guilt and spills the beans. Let's see.
Save me please from my life of sms online!
Google makes everyone better (none / 0) (#96)
by dogeye on Fri Jun 10, 2005 at 01:21:50 AM EST

Has anyone noticed that anytime google touches anything, their product is always superior to everything else on the market and it immediately causes the competition to improve themselves?

They've made recently invested in keyhole.com, and now keyhole has maps of the entire world! The faster google expands, the happier I'll be.

-Ryan the Infrared Sauna guy

Golden Google (none / 0) (#97)
by merrymissfuckingpoppins on Fri Jun 10, 2005 at 02:52:08 AM EST

I doubt their midas touch can last forever...

[ Parent ]
good print (none / 0) (#99)
by soart on Mon Jun 20, 2005 at 11:01:11 PM EST

A way to actually use google print consitently would be nice too.
机票打折机票
THIS IS VERY FUNNY (none / 0) (#100)
by prima1 on Tue Jun 21, 2005 at 12:11:25 PM EST

man this is very funny because just yesterday i found a way to browse through whole books and save their pages and i am no computer expert i found out that if i type the number 200 in the search box inside the book it will show all the pages of the book not missing one single page and everytime my account doesn't allow to view more pages i simply create a new account with a fake email and password no need to verify i have tons of accounts now i just enter a new one everytime and continue browsing but some pages of the books are restricted no matter what actually the restricted pages start at the middle of the book to the end whileas first half of the book is always viewable all of it anyway it doesn't seem to work anymore i am happy i saved everything that i needed they do not have everything anyway many books do not exist and many are old editions

Low text resolution:( (none / 0) (#101)
by azerty on Thu Jun 30, 2005 at 04:17:45 AM EST

In fact you don't need to know anything about cookies to download the whole book. I did this by entering the next page number in the search box and logging in and out from time to time. But what's a really problem is that when I tried to print out the book I realized that the quality is too bad for me to read from the paper. Oups:( spent some 5 hours downloading... The problem is that the resolution is too low for the text to be recognised. At least, I have used ABBY FineReader, and it failed to recognize the text. Anybody knows how to solve this problem?

tonmi (none / 0) (#102)
by tonmi on Tue Jan 03, 2006 at 11:14:01 AM EST

They've made recently invested in keyhole.com, and now keyhole has maps of the entire world! The faster google expands, the happier I'll be.

ocr (none / 0) (#103)
by Seifer on Mon Jan 16, 2006 at 10:21:12 AM EST

well i can not ocr with finereader either. does anyone know why? 'cause picture seems ok but when try to ocr it gets messy.

Hacking Google Print | 101 comments (68 topical, 33 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!