We knew the web was big...
How do we find all those pages? We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links. In fact, we found even more than 1 trillion individual links, but not all of them lead to unique web pages. Many pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other. Even after removing those exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day.
So how many unique pages does the web really contain? We don't know; we don't have time to look at them all! :-) Strictly speaking, the number of pages out there is infinite -- for example, web calendars may have a "next day" link, and we could follow that link forever, each time finding a "new" page. We're not doing that, obviously, since there would be little benefit to you. But this example shows that the size of the web really depends on your definition of what's a useful page, and there is no exact answer.
We don't index every one of those trillion pages -- many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn't very useful to searchers. But we're proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world's data.
To keep up with this volume of information, our systems have come a long way since the first set of web data Google processed to answer queries. Back then, we did everything in batches: one workstation could compute the PageRank graph on 26 million pages in a couple of hours, and that set of pages would be used as Google's index for a fixed period of time. Today, Google downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day. This graph of one trillion URLs is similar to a map made up of one trillion intersections. So multiple times every day, we do the computational equivalent of fully exploring every intersection of every road in the United States. Except it'd be a map about 50,000 times as big as the U.S., with 50,000 times as many roads and intersections.
As you can see, our distributed infrastructure allows applications to efficiently traverse a link graph with many trillions of connections, or quickly sort petabytes of data, just to prepare to answer the most important question: your next Google search.
from the google blog
http://googleblog.blogspot.com/2008/...b-was-big.html
Google gives GMail always-on encryption
By Dan Goodin in San Francisco
http://www.theregister.co.uk/2008/07/25/gmail_adds_https_only/
Google is adding a much-demanded feature to its email service that offers improved security by ensuring users get an encrypted connection each time they access their account via a web connection.
The new option means email sessions are automatically protected from start to finish with the secure sockets layer protocol even if a user accesses the account by typing http://gmail.com, rather than https://gmail.com/ (notice the presence of "https" in the latter).
The move helps protect users against a vulnerability known as sidejacking, which researcher Rob Graham of Errata Security warned against last year. It turns out the vast majority of websites drop the SSL protection as soon as a user has logged in. This allows attackers to snoop on web sessions over unsecured Wi-Fi connections even when a password was typed into a page during an encrypted session.
Google is one of the only services we know of that guards against this threat by offering start-to-finish SSL protection. But up to now, users ran the risk that a connection might inadvertently be unprotected, either because they forgot to type in the correct URL or the connect was reset.
To turn on the feature, open your GMail account, choose settings and scroll to the bottom of the page. In the section labeled "Browser Connection," choose the radio button that says "Always use https." Google warns the protection could slow down connections, so if you don't use insecure networks you may not want to bother. The offering doesn't appear to be available yet for Google Apps.
If only eBay, Yahoo Mail, MySpace, Facebook and the rest of the gang would follow suit.
Microsoft Challenges Google's PageRank Technology
By Mark Longhttp://www.crm-daily.com/story.xhtml?story_id=60984
Google's PageRank Web site-ranking method is being challenged by Microsoft. Microsoft's new tool, BrowseRank, aims to add a human factor to the site-ranking process. Microsoft claims PageRank does not take into account frequency and staying time of Web site visits, while BrowseRank monitors user behavior data to calculate page importance.
Microsoft Relevant Products/Services engineers, in collaboration with researchers at several Asian institutions, have proposed a new method for improving upon the Web page rankings produced by today's search engine requests. Called BrowseRank, the new approach adds a human factor to the process by weighing how people actually use the Internet, the collaborators reported in a paper recently presented before the Special Interest Group on Information Retrieval.
"The more visits [to] the page made by the users, and the longer time periods spent by the users on the page, the more likely the page is important," the paper's authors noted. The goal is to "leverage hundreds of millions of users' 'implicit voting' on page importance," they said, "in accordance with the concept of Web 2.0."
Missing the Mark
Google's trademarked PageRank method measures the relative importance of Web pages through the use of a sequence of data-processing instructions -- called a link analysis algorithm -- that assigns a numerical weighting to each element within any given set of hyperlinked documents.
"Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results," Google said. "We have always taken a pragmatic approach to help improve search quality and create useful products, and our technology uses the collective intelligence of the Web to determine a page's importance."
Gauging the relevance of Internet searches is extremely important to Google, Yahoo and Microsoft because it allows the search engine leaders to more precisely target their placement of ads on behalf of clients. But Microsoft and its collaborators claim that PageRank misses the mark because it allows the importance of pages to become artificially inflated.
For example, Web sites such as Adobe.com are ranked very high by PageRank because Adobe.com has millions of sites linking to it for Acrobat Reader and Flash Player downloads. "However, Web users do not really visit such Web sites very frequently, and they should not be regarded [as] more important than the Web sites on which users spend much more time, like MySpace.com and Facebook.com," they explained.
Giving Users a Vote
Microsoft and its academic collaborators say their new method is superior because it is based on a user-browsing graph that is generated from data that reflects actual human behavior. "User-behavior data can be recorded by Internet browsers at Web clients and collected at a Web server," they said.
BrowseRank's user-browsing graph can more precisely represent the Web surfer's random walk process, and thus is more useful for calculating page importance, the collaborators claim. Furthermore, the amount of time spent on the pages by users is also included under the BrowseRank method.
"In this way, we can leverage hundreds of millions of users' implicit voting on page importance," researchers explained. "Experimental results show that BrowseRank indeed outperforms the baseline methods, such as PageRank and TrustRank, in several tasks."
For its part, Google notes that PageRank, which is based on a Stanford University patent, is not the only method it employs to rank search engine results. Instead, Google said it relies on more than 200 different signals to examine the entire link structure of the Web and determine which pages are most important.
"We then conduct hypertext-matching analysis to determine which pages are relevant to the specific search being conducted," Google explained. "By combining overall importance and query-specific relevance, we're able to put the most relevant and reliable results first."
Advice to employees on proper use of the System Administrator's valuable time
(In following examples, we will substitute the name "Ted" as the System Administrator)
- Make sure to save all your MP3 files on your network drive. No sense in wasting valuable space on your local drive! Plus, Ted loves browsing through 100+ GB of music files while he backs up the servers.
- Play with all the wires you can find. If you can't find enough, open something up to expose them. After you have finished, and nothing works anymore, put it all back together and call Ted. Deny that you touched anything and that it was working perfectly only five minutes ago. Ted just loves a good mystery. For added effect you can keep looking over his shoulder and ask what each wire is for.
- Never write down error messages. Just click OK, or restart your computer. Ted likes to guess what the error message was.
- When talking about your computer, use terms like "Thingy" and "Big Connector."
- If you get an EXE file in an email attachment, open it immediately. Ted likes to make sure the anti-virus software is working properly.
- When Ted says he coming right over, log out and go for coffee. It's no problem for him to remember your password.
- When you call Ted to have your computer moved, be sure to leave it buried under a year-old pile of postcards, baby pictures, stuffed animals, dried flowers, unpaid bills, bowling trophies and Popsicle sticks. Ted doesn't have a life, and he finds it deeply moving to catch a glimpse of yours.
- When Ted sends you an email marked as "Highly Important" or "Action Required", delete it at once. He's probably just testing some new-fangled email software.
- When Ted's eating lunch at his desk or in the lunchroom, walk right in, grab a few of his fries, then spill your guts and expect him to respond immediately. Ted lives to serve, and he's always ready to think about fixing computers, especially yours.
- When Ted's at the water cooler or outside taking a breath of fresh air, find him and ask him a computer question. The only reason he takes breaks at all is to ferret out all those employees who don't have email or a telephone.
- Send urgent email ALL IN UPPERCASE. The mail server picks it up and flags it as a rush delivery.
- When the photocopier doesn't work, call Ted. There's electronics in it, so it should be right up his alley.
- When you're getting a NO DIAL TONE message at your home computer, call Ted. He enjoys fixing telephone problems from remote locations. Especially on weekends.
- When something goes wrong with your home PC, dump it on Ted's chair the next morning with no name, no phone number, and no description of the problem. Ted just loves a good mystery.
- When you have Ted on the phone walking you through changing a setting on your PC, read the newspaper. Ted doesn't actually mean for you to DO anything. He just loves to hear himself talk.
- When your company offers training on an upcoming OS upgrade, don't bother to sign up. Ted will be there to hold your hand when the time comes.
- When the printer won't print, re-send the job 20 times in rapid succession. That should do the trick.
- When the printer still won't print after 20 tries, send the job to all the printers in the office. One of them is bound to work.
- Don't use online help. Online help is for wimps.
- Don't read the operator's manual. Manuals are for wussies.
- If you're taking night classes in computer science, feel free to demonstrate your fledgling expertise by updating the network drivers for you and all your co-workers. Ted will be grateful for the overtime when he has to stay until 2:30am fixing all of them.
- When Ted's fixing your computer at a quarter past one, eat your Whopper with cheese in his face. He functions better when he's slightly dizzy from hunger.
- When Ted asks you whether you've installed any new software on your computer, LIE. It's no one else's business what you've got on your computer.
- If the mouse cable keeps knocking down the framed picture of your dog, lift the monitor and stuff the cable under it. Those skinny Mouse cables were designed to have 55 lbs. of computer monitor crushing them.
- If the space bar on your keyboard doesn't work, blame Ted for not upgrading it sooner. Hell, it's not your fault there's a half pound of pizza crust crumbs, nail clippings, and big sticky drops of Mountain Dew under the keys.
- When you get the message saying "Are you sure?", click the "Yes" button as fast as you can. Hell, if you weren't sure, you wouldn't be doing it, would you?
- Feel perfectly free to say things like "I don't know nothing about that boneheaded computer crap." It never bothers Ted to hear his area of professional expertise referred to as boneheaded crap.
- Don't even think of breaking large print jobs down into smaller chunks. God forbid somebody else should sneak a one-page job in between your 500-page Word document.
- When you send that 500-page document to the printer, don't bother to check if the printer has enough paper. That's Ted's job.
- When Ted calls you 30 minutes later and tells you that the printer printed 24 pages of your 500-page document before it ran out of paper, and there are now nine other jobs in the queue behind yours, ask him why he didn't bother to add more paper.
- When you receive a 130 MB movie file, send it to everyone as a high-priority mail attachment. Ted's provided plenty of disk space and processor capacity on the new mail server for just those kinds of important things.
- When you bump into Ted in the grocery store on a Sunday afternoon, ask him computer questions. He works 24/7, and is always thinking about computers, even when he's at super-market buying toilet paper and doggie treats.
- If your son is a student in computer science, have him come in on the weekends and do his projects on your office computer. Ted will be there for you when your son's illegal copy of Visual Basic 6.0 makes the Access database keel over and die.
- When you bring Ted your own "no-name" brand PC to repair for free at the office, tell him how urgently he needs to fix it so you can get back to playing EverQuest. He'll get on it right away, because everyone knows he doesn't do anything all day except surf the Internet.
- Don't ever thank Ted. He loves fixing everything AND getting paid for it!
Rogers Looks For New Ways To Annoy Customers, Hijacks Failed DNS Lookups
http://techdirt.com/articles/20080720/1055151734.shtml
Rogers -- a Canadian telco -- has been attracting a lot of negative attention lately between deliberately disabling notifications for cellular roaming charges, setting ridiculous iPhone pricing plans and injecting its own content into Google's home page. As if that wasn't enough, Rogers has started hijacking failed DNS lookups. This means that when a user types in a web address that doesn't exist, instead of getting a "page not found" error, the user is redirected to a search page filled with banner ads and sponsored links. Michael Geist notes that there's an "opt-out" feature, but it doesn't take long to see that it's pretty pathetic. The "opt-out" sends a cookie which just redirects the user to a different Rogers page instead -- a fake "Internet Explorer" error page hosted on the same server. It does essentially the exact same thing, only pretending (poorly, for non-IE users) to revert back to expected behavior. And the option is reset whenever the browser's cookies are cleared. The comments on Geist's post are evidence that many Rogers customers are not pleased (myself included).
This isn't just annoying, it's also a security threat. It breaks how the internet was designed to work; a lot of software is written with the expectation that a DNS lookup for a non-existent domain name will return an error. For example, Kevin Dean notes in the comments on Geist's post how this has caused problems for him accessing his VPN. At first, he thought his computer had been compromised, since Rogers' new "feature" ends up resembling a hostile attempt to redirect traffic to an unknown server.
Some American ISPs already do this, such as Earthlink (which was used to demonstrate the security risk), though it seems to have a slightly better opt-out process, instructing users to configure alternate DNS servers instead of setting a browser cookie. VeriSign had originally tried to do something similar with SiteFinder back in 2003 (though not at the ISP level), but it didn't exactly go over too well. VeriSign reluctantly backed off, though it just recently obtained a patent on the concept. Rogers is the first Canadian ISP to implement the practice and it seems to think it won't meet much resistance. In another comment on Geist's post, Ian relates a telling quote from the FAQs page for Paxfire (the American company handling this for Rogers): "What feedback you do receive typically will come from a small group of highly technical users. Even that feedback tends to fall away after just a few weeks -- as they get used to the new behavior."
Rogers thinks it can just brush off complaints from its users, especially since there really isn't a lot of choice in the Canadian ISP market. However, Rogers should be careful in treading so brazenly into what some consider "net neutrality" territory. Bell Canada (one of Rogers' few competitors) has landed itself in front of a national regulatory body over its throttling practices. Rogers wants to have complete control over its network, but by continually pushing the line they only spur on the debate about net neutrality and government regulation. We haven't heard the last of this.