Data exploration from: An experiment in trying to predict Google rankings Blog

Data exploration from: An experiment in trying to predict Google rankings

by 7wData
August 8, 2016

Over the last few months, we have been working with a company named Statec (a data science company from Brazil) to engineer features for predictive algorithms. One of the initial considerations in working with predictive algorithms is picking relevant data to train them on.

We set out quite naively to put together a list of webpage features that we thought may offer some value. Our goal was simply to see if from available features, we could get close to predicting the rank of a webpage in Google. We learned soon into this process that we had to put blinders on to data that was unreachable and hope for the best with what we had.

The following is an analysis of the data we collected, how we collected it and useful correlations derived from the data.

One initial problem was that we needed to gain access to ranking data for enough search engine results page (SERP) results to provide a useful training set. Luckily, GetStat made this very easy. With GetStat, we simply loaded up keyword combinations across the top 25 service industries with the location of the top 200 cities (by size) in the US. This resulted in 5,000 unique search terms (e.g., “Charlotte Accountant” taken from Charlotte, NC).

My company, Consultwebs, is focused on legal marketing, but we wanted the model to be more universal. After loading up the 5,000 terms and waiting a day, we then had roughly 500,000 search results we could use to construct our data set.

After finding this so easy, I collected the rest of the data. I had built several crawlers with Node.js, so I decided to build a feature extraction mechanism on top of pre-existing work. Luckily, Node.js is an excellent ecosystem for this type of job. Below I list several libraries that make Node wonderful for data collection:

The crawling process was very slow, due mainly to hit limits by API providers and our proxy service. We would have created a cluster, but the expense limited us to hitting a couple of the APIs about once per second.

Slowly, we gained a complete crawl of the full 500,000 URLs. Below are a few notes on my experience with crawling URLs for data collection:

We have reported our results from the ranking predictions in a separate post, but I wanted to review some of the interesting insights in the data collected.

For this data, we reduced the entire data set to only include rankings in the top 20 and also removed the top four percent of observations based on referring domains. The goal in removing the top four percent of referring domains was to keep URLs such as Google, Yelp and other large websites from having undue influence on the averages. Since we were focusing on service industry results, we wanted to make sure that local business websites would likely be compared, and not major directories.

In the chart below, we assume that the web designer category is the largest due to the practice of footer links from website work. The second two highest are no surprise to those of us who work in the legal niche.

Again we filtered to the top 20 ranking results across all observations and also removed the top four percent of observations based on referring domains to remove URLs from Google, Yelp and other large websites. Feel free to use this in proposals when qualifying needs for clients in particular cities.

The top results here are no surprise to those of us who have had clients in these cities. New York, in particular, is a daunting task for many niches.

For this data, we kept the full rank data at 100 results per search term, but we removed observations with referring domains over the top four percent threshold and over 5,000 Facebook shares. This was a minimal reduction to the overall size, yet it made the data plot much cleaner.

The plot reminds me of when I go out to the shooting range, in that there is really no order to shots. The Pearson correlation of average shares to rank is 0.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Data exploration from: An experiment in trying to predict Google rankings

Leave a Reply Cancel reply

Upcoming Events

The Role of Taxonomy and Ontology in Semantic Layers

Evolving Your Data Architecture for Trustworthy Generative AI

World Wide Data Vault Consortium 2024

Shift Difficult Problems Left with Graph Analysis on Streaming Data

Categories

Tags

You Might Be Interested In

How will smart security and the IoT transform our homes?

Including ModelOps in your AI strategy

Continuous Delivery for platform owners

Recent Jobs

Associate Director for Impact and Analytics

Data Scientist: Support NYS Attorney General Investigations

Judiciary Research Manager (Court Executive 2B)

Cyber Security Engineer – P2

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

Data exploration from: An experiment in trying to predict Google rankings

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change