Archive for May, 2013

You Can’t Reverse Engineer The Google Algorithm

nerual_network
As most SEOs are aware, Google just launched the latest penguin algorithm. Amid the mass panic that follows all algorithm updates, several SEOs started discussing the Google algorithm and their theories. Don’t worry, I’m not going to tell you how to recover from it or anything like that. Instead, I want to focus on some of the discussion points I saw flying around the web. It’s become clear to me that an overwhelming majority of SEOs have very little computer science training or understanding of computer algorithms. I posted a rant a few weeks ago that briefly touched on this, but now that I can actually type (no more elbow cast!) I’d like to delve a bit deeper into some misconceptions about the Google algorithm.

It’s always been my belief that SEOs should know how to program and now I’d like to give a few examples about how programming knowledge shapes SEO thought processes. I’d also like to add the disclaimer that I don’t work at Google (although I was a quality rater many years ago) and I don’t actually know the Google algorithm. I do have a computer science background though, and still consider myself a pretty good programmer. I’ll also argue (as you will see in this post) that nobody really knows the Google algorithm – at least not in the sense you’re probably accustomed to thinking of.

SEOs who’ve been at it for a while remember the days of reverse engineering the algorithm. Back in the late 90s, it was still pretty easy. Search engines weren’t that complex and could easily be manipulated. Unfortunately, that’s no longer the case. We need to evolve our thinking beyond the typical static formula. There’s just no way the algorithm is as simple as set of weights and variables.

You can’t reverse engineer a dynamic algorithm unless you have the same crawl data.

The algorithm isn’t static. As I mentioned in my rant, many theories in information retrieval talk about dynamic factor weights based on the corpus of results. Quite simply, that means that search results aren’t ranked according to a flat scale, they’re ranked according to the other sites that are relevant to that query. Example: If every site for a given query has the same two word phrase in its title tag, then that phrase being in the title won’t contribute highly to the ranking weights. For a different search though, where only 20% of the results have that term in the title, it would be a heavy ranking factor.

What we do know is that there are 3 main parts to a Google search. Indexing, which happens before you search so we won’t cover it here, result fetching, and result ranking. Result fetching is pretty simple at a high level. It goes through the index and looks for all documents that match your query. (there’s probably some vector type stuff going on with mutltiple vectors for relevancy and authority and what not, but that’s way out of this scope.) Then, once all the pages are returned, they’re ranked based on factors. When those factors are evaluated, they’re most likely evaluated based only on the corpus of sites returned.

I want to talk about T trees and vector intersections and such, however I’m going to use an analogy here instead. In my earlier rant I used the example of car shopping and how first you sort by class, then color, etc – but if all the cars are red SUVs you then sort by different factors.

Perhaps a better way is to think of applying ranking factors like we alphabetize words. Assume each letter in a word is a ranking factor. For example, in he word “apple” the “a” might be keyword in title tag. The “p” might be number of links and the “e” might be something less important like page speed (Remember when Cutts said “all else being equal, we’ll return the faster result? that fits here.) Using this method, ranking some queries would be easy. We don’t need many factors to see that apple comes before avocado. But what about pear and pearl? In the apple/avocado example, the most significant (and important) ranking factor is the 2nd letter. In the pear example though, the first four factors are less important than the l at the end of the word. Ranking factors are the same way: They change based on the set of sites being ranked! (and get more complicated when you factor in location, personalization, etc – but we’ll tackle all that in another post.)

It’s not just dynamic, it’s constantly learning too!

For a few years now I’ve had the suspicion that Google is really just one large-scale neural network. When I read things like this and then see the features they just released for Google+ images, I know they’ve got large-scale neural nets mastered.

What’s a Neural Network? Well, you can go read about it on Wikipedia if you want, but quite simply a neural network is a different type of algorithm. It’s one where you give it the inputs and the desired outputs, and it uses some very sophisticated math to calculate the best and most reliable way to get from those inputs to the outputs. Once it does that, you can give a larger set of inputs and it can use the same logic to expand the set of outputs. In my college artificial intelligence class (back in 2003) I used a rudimentary one to play simple games like nim and even to ask smart questions to determine which type of sandwich you were eating. (I fed it in a list of known ingredients and sandwich definitions, and it came up with the shortest batch of questions to ask to determine what you had. Pretty cool) The point is that if I could code a basic neural net in lisp on a pentium1 laptop 10 years ago, I’m pretty sure Google can use way more advanced types of learning algorithms to do way cooler things. Also, ranking link signals is WAY less complicated than finding faces and cats in photos.

Anyway.. when I think of Penguin and Panda and hear that they have to be run independent of the main search ranking algorithm, my gut instantly screams that these are neural nets or similar technology. Here’s some more evidence: From leaked documents we know that Google uses human quality raters and that some of their tasks involve rating documents as relevant, vital, useful, spam, etc. Many SEOs instantly though “OMG, actual humans are rating my site and hurting my rankings.” The clever SEOs though saw this as a perfect way to create a training set for a neural network type algorithm.

By the way, there’s no brand bias either

Here’s another example. Some time ago @mattcutts said “We actually came up with a qualifier to say OK NYT or Wikipedia or IRS on this side, low quality sites over on this side.” Many SEOs took that to mean that Google has a brand bias. I don’t think that’s the case. I think what Matt was talking about here was using these brand sites as part of the algorithm training set for what an “authoritative” site is. They probably looked at what sites people were clicking on most or what quality raters chose as most vital and then fed them in as a training set.

There’s nothing in the algorithm that says “rank brands higher” (I mean, how does an algorithm know what a brand is? Wouldn’t it be very easy to fake?) – it’s most likely though that the types of signals that brand sites have were also the types of signals Google wants to reward. You’ve heard me say at countless conferences: “Google doesn’t prefer brands, people searching Google do.” That’s still true and that’s why brand sites make a good training set for authority signals. When people stop preferring brands over small sites, Google will most likely stop ranking them above smaller sites.

We need to change our thought process

We really need to stop reacting literally to everything Google tells us and start thinking about it critically. I keep thinking of Danny Sullivan’s epic rant about directories. When Matt said “get directory links” he meant get links for sites people visit. Instead, we falsely took that as “Google has a flag that says this site is a directory and gives links on it more weight, so we need to create millions of directories.” We focused on the what, not the why.

We can use our knowledge of computer science here. It’s crucial. We need to stop thinking of the algorithm as a static formula and start thinking bigger. We need to stop trying to reverse engineer it and focus more on the intent and logic behind it. When Google announces they’re addressing a problem we should think about how we’d also solve that problem in a robust and scalable way. We shouldn’t concern ourselves so much with exactly what they’re doing to solve it but instead look at the why. That’s the only true way to stay ahead of the algorithm.

Ok, that’s a lot of technical stuff. What should I take away?

  1. You can’t reverse engineer the algorithm. Neither could most Googlers.
  2. The algorithm, ranking factors, and their importance change based on the query and the result set.
  3. The algorithm learns based on training data.
  4. There’s no coded-in “brand” variable.
  5. Human raters are probably a.) creating training sets and b.) evaluating result sets of the neural network style algorithm

18 comments May 23rd, 2013

Rant: SEO Tests, Cutts Statements, & The Algorithm

belief_200-155062f4e4d8080e7e533ecc37b0db156b3a5474-s6-c10

I’m going to channel my inner @alanbleiweiss and rant for a minute about some things I saw over the last few days in the SEO world. I also want to apologize for any spelling mistakes from the start, as my right arm is in a cast and I’m typing this entirely left-handed until I can find an intern. (If you’re curious as to how I broke my arm, it was with a softball. there’s a video here. )

There’s been lots of SEO chatter lately about a recent SEL post called More Proof Google Counts Press Release Links. and I want to address a couple of issues that came up both in this thread and on Twitter.

First point: what works for one small made-up keyword may not scale or be indicative of search as a whole. Scientists see this in the real world when they notice that Newton’s laws don’t really work at the subatomic level. In SEO algorithms, we have the same phenomenon – and it’s covered in depth by many computer science classes. (Note: I have a computer science degree and used to be a software engineer, but I haven’t studied too much in the information retrieval field. There’s more in depth and profound techniques than the examples I am about to provide.)

A long time ago the Google algorithm was probably just a couple of orders more complex than an SQL statement that says something like “Select * from sites where content like ‘%term’ order by pagerank desc.”

It’s not that simple anymore. Most people think of the algorithm like a static equation. Something like Pagerank + KeywordInTitle – ExactMatchDomain – Penguin – Panda + linkDiversity-Loadtime. I’m pretty sure it’s not.

When I think of the Google Algorithm, (especially with things like Panda and Penguin) I instantly think of a neural network where the algorithm is fed a training set of data and it builds connections to constantly learn and improve what good results are. I’ll refrain from talking more about neural nets because that’s not my main point.

I also want to talk about the branch of information retrieval within computer science. Most of the basic theories (on which, the more complicated ones are built) in IR talk about dynamic weighting based on the corpus. (Corpus, being latin for body and referring here to all of the sites that Google could possible return for a query.)

Here’s an example that talks about one such theory (which uses’s everybody’s favorite @mattcutts over-reaction from 2 years ago: inverse document frequency)

Basically, what this says is that if every document in the result set has the same term on it, that term becomes less important. That makes sense. The real learning here though, is that the weighting of terms is dynamic based on the result set. If term weights can be dynamic for each result set, why can’t anchor text, links, page speed, social signals, or whatever other crazy thing is correlated to rankings? They Can Be!

So let’s look at the made up keyword example. In the case of a made up term, the corpus is very very small. In the SEL example, it’s also very very small.

Now, in this instance, what should Google do? It has pages that contain that word, but they don’t have any traditionally heavily weighted ranking signals. Rather than return no results, the ranking factor weights are changed and the page is returned. That one link actually helps when there’s no other factors to consider. get it?

Think of it as kind of a breadth first search for ranking factors. Given a tree of all factors Google knows about, it first looks at the main ones. If they aren’t present, it goes further down the tree to the less important ones and keeps traversing the tree until it finds something it can use to sort the documents.

It’s like choosing a car. First you decide SUV or Car. Then Brand, Then manual or automatic. Then maybe the color, and finally it’s down to the interface of the radio. But what if the entire car lot only had Red Automatic SUVs? That radio interface would be a LOT more important now wouldn’t it? Google is doing the same thing.

OK, point number 2. Still with me?

We need to stop analyzing every word @mattcutts says like it’s some lost scripture and start paying attention to the meaning of what he says. In this example, Matt was right. Press releases aren’t helping your site – because your site is probably going after keywords that exist on other sites, and since there’s other sites that means the press release link factor is so far down the tree of factors that it’s probably not being used.

Remember when Matt said that Page Speed was a “all else being equal we’ll return the faster site” type of factor? That fits perfectly with the tree and dynamic weights I just talked about.

Instead of looking at the big picture, the meaning, and the reasoning behind what Matt says, we get too caught up on the literal definitions. It’s the equivalent of thinking David and Goliath is a story about how there are giants in the world rather than a story about how man’s use of technology helps him overcome challenges and sets him apart from beasts. We keep taking the wrong message because we’re too literal.

That’s all I want to say. Feel free to leave feedback in the comments.

15 comments May 7th, 2013


About Ryan Jones

Name: Ryan Jones
Alias: HockeyGod
Location: Michigan
Company: Team Detroit
Title: Sr. Search Strategist
AIM: TheHockeyGod
Pets: Who Dey

Twitter & Klout



My Websites

Internet Slang Dictionary
Fail Pictures
FeedButton
Translate British
TextSendr
URL Shortener
Bad Words
WoW Slang
Free Softball Stats

Buy My Book

Recent dotCULT Posts

Calendar

May 2013
M T W T F S S
 12345
6789101112
13141516171819
20212223242526
2728293031  

Posts by Month

Posts by Category

Subscribe To RSS Feed

Link Me





ypblogs.com