I Wanna Index You Up: Google Exposed

November 6th, 2014 by Search Influence Alumni

Google is a great keeper of secrets. They’re willing to share parts of demographic data captured from their users with potential advertisers. But, they keep their own personal data locked up pretty tight, or at least tight enough to keep anyone without a neurosis for data away.

About a month ago, I was trying to get an idea of URLs that existed on a client’s website. Due to their unfortunately dated decision to integrate a Flash-based navigation on their front page, I could not use a sitemap generator — software that crawls a site and outputs all URLs — to gather up a URL list.

Because I’m pretty lazy, or some could argue resourceful, I decided to look into what Google indexed for that site. However, you can’t exactly text Google up to request the URLs be sent over.


It’s Up To You, My Dude

Although Google Webmaster Tools provides you with a lot of info about your site, including how many URLs are indexed, it does not tell you which URLs are indexed. So I did a little bit of digging, and I found this “bookmarklet” that easily captures the SERPs that appear on a given page of results, and lists them in an easy-to-import-into-Excel format.


To see a list of URLs (or at least a partial list) Google has indexed for a specific domain name, query Google for “site:sitename.com.”

Here’s a reenactment of what I saw when searching for the indexed URLs:


Once you have that bookmarklet added to your bookmarks bar (just drag and drop, as the instructions say), clicking it gives results like this in a new tab:


By visually filtering out the worthless stuff (JavaScript, a link to Youtube, and blank rows), I’m left with this tidy list of  first page of results. Sadly, we find that there are only 9 pages indexed of shrimp.com, inclusive of its subdomains, within Google:

For the purpose of finding the indexed URLs for my client, I just kept clicking “Next Page” and running this until I couldn’t get any more results.

And with that, I had a pretty good idea of URLs that were actually indexed by Google. Luckily, this was a reasonable amount of pages to parse through, but I could imagine this being particularly tedious process for larger sites.

Although you may want to take the results of these queries with a grain of salt, given the presence of Google’s filter bubble, its accessibility to non-technical users make it a helpful tool.

In my experience at Search Influence, we’ve had a few clients with indexing issues related to pages that are actually discoverable by search. We knew how many URLs were indexed, as well as how many we expected to be indexed, but we did not know which URLs were not part of their index. By cross-referencing a sitemap with all discoverable URLs versus the results of this manual URL scraping of Google’s index, we can have a clue into what the heck’s wrong, and start troubleshooting with more focus.