With Scrapebox being an extremely flexible tool for webmasters. Learning it well is highly recommended if you are getting into web scraping. or link building. Scrapebox really is the swiss army knife of SEO tools. If you can find a way to search for what you want in Google. Then you can generally make it happen in SB, and automate the task. Without using the proper footprints and filters, this tool might seem to be a waste of time.

I know it took me awhile to fully understand what was going on with it. You need to learn how to refine the results, and find the common patterns in them. I have noticed its kind of hard to find a bunch of example footprints in the same area, so I decided to find a few that I’ve been using lately.  I am hoping this will be more of a learning lesson for you all, more so than copy, paste, and use. So play around with it, and see what unique footprints you can create.

Most of these lists below are ones you would merge into a list of keywords you have loaded into scapebox. Copy and paste the footprints into a .txt file. Scrape your keyword list, and hit the M button to import the footprint list.

Note: Merging a large list of keywords with a bigger footprint file can result in millions of keywords.  


Searching for keywords within HTML of Indexed 2.0 sites

%kw% inurl:wordpress.com
%kw% inurl:21publish.com
%kw% inurl:bloglines.com
%kw% inurl:wikidot.com
%kw% inurl:43things.com
%kw% inurl:weebly.com
%kw% inurl:zoho.com
%kw% inurl:buzzle.com
%kw% inurl:typepad.com
%kw% inurl:angelfire.lycos.com
%kw% inurl:diaryland.com
%kw% inurl:blogger.com
%kw% inurl:webnode.com
%kw% inurl:onsugar.com
%kw% inurl:livejournal.com
%kw% inurl:webstarts.com
%kw% inurl:squidoo.com
%kw% inurl:posterous.com
%kw% inurl:webspawner.com
%kw% inurl:tripod.lycos.com
%kw% inurl:blogspot.com
%kw% inurl:angelfire.com
%kw% inurl:wikispaces.com
%kw% inurl:bravenet.com
%kw% inurl:blog.com
%kw% inurl:xanga.com
%kw% inurl:hubpages.com
%kw% inurl:yola.com
%kw% inurl:blogsome.com
%kw% inurl:jimdo.com
%kw% inurl:webs.com
%kw% inurl:sosblog.com


Searching for keywords within URL of Indexed 2.0 sites

inurl:”%kw%” + inurl:wordpress.com
inurl:”%kw%” + inurl:21publish.com
inurl:”%kw%” + inurl:bloglines.com
inurl:”%kw%” + inurl:wikidot.com
inurl:”%kw%” + inurl:43things.com
inurl:”%kw%” + inurl:weebly.com
inurl:”%kw%” + inurl:zoho.com
inurl:”%kw%” + inurl:buzzle.com
inurl:”%kw%” + inurl:typepad.com
inurl:”%kw%” + inurl:angelfire.lycos.com
inurl:”%kw%” + inurl:diaryland.com
inurl:”%kw%” + inurl:blogger.com
inurl:”%kw%” + inurl:webnode.com
inurl:”%kw%” + inurl:onsugar.com
inurl:”%kw%” + inurl:livejournal.com
inurl:”%kw%” + inurl:webstarts.com
inurl:”%kw%” + inurl:squidoo.com
inurl:”%kw%” + inurl:posterous.com
inurl:”%kw%” + inurl:webspawner.com
inurl:”%kw%” + inurl:tripod.lycos.com
inurl:”%kw%” + inurl:blogspot.com
inurl:”%kw%” + inurl:angelfire.com
inurl:”%kw%” + inurl:wikispaces.com
inurl:”%kw%” + inurl:bravenet.com
inurl:”%kw%” + inurl:blog.com
inurl:”%kw%” + inurl:xanga.com
inurl:”%kw%” + inurl:hubpages.com
inurl:”%kw%” + inurl:yola.com
inurl:”%kw%” + inurl:blogsome.com
inurl:”%kw%” + inurl:jimdo.com
inurl:”%kw%” + inurl:webs.com
inurl:”%kw%” + inurl:sosblog.com


Searching for expired tumblr accounts on high pr domains

Updated: 2/26/17

site:%kw% tumblr q
site:%kw% tumblr w
site:%kw% tumblr e
site:%kw% tumblr r
site:%kw% tumblr t
site:%kw% tumblr y
site:%kw% tumblr u
site:%kw% tumblr i
site:%kw% tumblr o
site:%kw% tumblr p
site:%kw% tumblr a
site:%kw% tumblr s
site:%kw% tumblr d
site:%kw% tumblr f
site:%kw% tumblr g
site:%kw% tumblr h
site:%kw% tumblr j
site:%kw% tumblr k
site:%kw% tumblr l
site:%kw% tumblr z
site:%kw% tumblr x
site:%kw% tumblr c
site:%kw% tumblr v
site:%kw% tumblr b
site:%kw% tumblr n
site:%kw% tumblr m
site:%kw% tumblr 1
site:%kw% tumblr 2
site:%kw% tumblr 3
site:%kw% tumblr 4
site:%kw% tumblr 5
site:%kw% tumblr 6
site:%kw% tumblr 7
site:%kw% tumblr 8
site:%kw% tumblr 9
site:%kw% tumblr 0

Blog Comment footprints scraped from a Auto Approve List

“Please enter your name” +”%kw%”
“Your email is never published nor shared” +”%kw%”
“will not be published required” +”%kw%”
“Please enter uppercase letters” +”%kw%”
“Free case consultation Name Email” +”%kw%”
“Which Service Are You Interested In” +”%kw%”
“I have read and agree term of service” +”%kw%”
“Yes add me to your mailing list” +”%kw%”
“Please include your Forename” +”%kw%”
“Mail will not be published” +”%kw%”
“Your email address will not be published Name” +”%kw%”
“Please leave this field empty” +”%kw%”
“Notify me of followup comments via e mail” +”%kw%”
“Name Email Comment are Required” +”%kw%”
“E posta will not be published” +”%kw%”
“Type Or Paste Password Here” +”%kw%”
“Notify me of follow up comments via e mail” +”%kw%”
“Click here to cancel reply” +”%kw%”
“E mail never displayed” +”%kw%”
“Your email address will not be published Author” +”%kw%”
“Name not required for anonymous comments” +”%kw%”
“Would you like to receive our newsletters” +”%kw%”
“Please correct the following and resubmit thanks!” +”%kw%”
“Required fields are marked CommentNavn” +”%kw%”
“Keep up to date with our latest offers Yes No” +”%kw%”
“Comments You may use HTML tags for style” +”%kw%”
“Error! Please validate your fields” +”%kw%”
“International Search Engine Optimization” +”%kw%”
“How May We Help You” +”%kw%”
“required to prevent spam” +”%kw%”
“Address to receive” +”%kw%”
“Your data will be safe!” +”%kw%”
“you MUST enable javascript to be able to comment” +”%kw%”
“Your Phone number required” +”%kw%”
“Free case evaluationName Email” +”%kw%”
“We welcome any feedback questions or comments” +”%kw%”
“Mail is not sent” +”%kw%”
“Time limit is exhausted Please reload the CAPTCHA” +”%kw%”
“Leave This Field Empty” +”%kw%”
“required will not be published” +”%kw%”
“Votre Email ne sera pas publié” +”%kw%”
“this will not be shared” +”%kw%”
“Thank you for leaving a message” +”%kw%”
“You can use these HTML tags” +”%kw%”
“Additional information and cargo details” +”%kw%”
“Fields marked with an asterisk are required” +”%kw%”
“Select an image for your comment GIF PNG JPG JPEG” +”%kw%”
“Your email is never shared Name” +”%kw%”
“Have Questions or Need Help” +”%kw%”


Search for High PR Forums

Credit: endgeek

“Powered by PunBB”
“Powered By MyBB”
“Powered by WowBB”
“Powered by XMB”
“Powered by FluxBB”
“Powered by SMF”
“Powered by Simple Machines”
“Forum Software: Burning Board”
“Forensoftware: Burning Board”
“Powered by myUPB”
“Powered by Quicksilver Forums”
“Based on MercuryBoard”
“2001..2017 Snitz Communications”
“2001..2017 Web Wiz Ltd”
“Running MegaBBS ASP Forum Software”
“Powered By IP.Board”
“Powered by Invision Power Board”
“Powered by vBulletin”
“Powered by phpBB”
“Powered by IceBB”
“Powered by bbPress”
“Powered by E-Blah Forum Software”
“Powered by FUDforum”
“Powered by YAF”
“Powered by Forum Software miniBB”
“Powered by YaBB”
“Powered By ExpressionEngine”
“Powered by IceBB”
“Powered by SEO-Board”
“Powered by UBB.threads”
“Powered by UseBB Forum Software”
“Powered by XennoBB “
“Powered by JavaBB”
“Powered by Viscacha”


Expired Domain Search of sites with Indexed High PR inbound links.

(Use the Link extractor addon with url related keywords as filters to expand this list for more internal links. Then scrape that list for all of their outbound links. Trim to root, and check if they are registered or not) 

site:buzzfeed.com inurl:%kw%
site:huffingtonpost.com inurl:%kw%
site:foxnews.com inurl:%kw%
site:nytimes.com inurl:%kw%
site:usmagazine.com inurl:%kw%
site:vice.com inurl:%kw%
site:thoughtcatalog.com inurl:%kw%
site:nbcnews.com inurl:%kw%
site:latimes.com inurl:%kw%
site:washingtonexaminer.com inurl:%kw%
site:medicalnewstoday.com inurl:%kw%
site:theweek.com inurl:%kw%
site:thehollywoodgossip.com inurl:%kw%
site:theguardian.com inurl:%kw%
site:oregonlive.com inurl:%kw%
site:businessinsider.com inurl:%kw%
site:hollywoodreporter.com inurl:%kw%
site:chicagotribune.com inurl:%kw%
site:Wikipedia.org inurl:%kw%
site:Diply.com inurl:%kw%
site:Forbes.com inurl:%kw%
site:Westernjournalism.com inurl:%kw%
site:Medium.com inurl:%kw%
site:Bloomberg.com inurl:%kw%
site:Nypost.com inurl:%kw%


Find other domains backlinks that are indexed.

Keyword format for a domain name before merging this file is domain.com. Do not use HTTP:// or www. in the keyword list. This list will prepopulate that for you.

“%kw%” -site:%kw% q
“%kw%” -site:%kw% w
“%kw%” -site:%kw% e
“%kw%” -site:%kw% r
“%kw%” -site:%kw% t
“%kw%” -site:%kw% y
“%kw%” -site:%kw% u
“%kw%” -site:%kw% i
“%kw%” -site:%kw% o
“%kw%” -site:%kw% p
“%kw%” -site:%kw% a
“%kw%” -site:%kw% s
“%kw%” -site:%kw% d
“%kw%” -site:%kw% f
“%kw%” -site:%kw% g
“%kw%” -site:%kw% h
“%kw%” -site:%kw% j
“%kw%” -site:%kw% k
“%kw%” -site:%kw% l
“%kw%” -site:%kw% z
“%kw%” -site:%kw% x
“%kw%” -site:%kw% c
“%kw%” -site:%kw% v
“%kw%” -site:%kw% b
“%kw%” -site:%kw% n
“%kw%” -site:%kw% m
“%kw%” -site:%kw% 1
“%kw%” -site:%kw% 2
“%kw%” -site:%kw% 3
“%kw%” -site:%kw% 4
“%kw%” -site:%kw% 5
“%kw%” -site:%kw% 6
“%kw%” -site:%kw% 7
“%kw%” -site:%kw% 8
“%kw%” -site:%kw% 9
“%kw%” -site:%kw% 0
“//www.%kw%” -site:%kw% q
“//www.%kw%” -site:%kw% w
“//www.%kw%” -site:%kw% e
“//www.%kw%” -site:%kw% r
“//www.%kw%” -site:%kw% t
“//www.%kw%” -site:%kw% y
“//www.%kw%” -site:%kw% u
“//www.%kw%” -site:%kw% i
“//www.%kw%” -site:%kw% o
“//www.%kw%” -site:%kw% p
“//www.%kw%” -site:%kw% a
“//www.%kw%” -site:%kw% s
“//www.%kw%” -site:%kw% d
“//www.%kw%” -site:%kw% f
“//www.%kw%” -site:%kw% g
“//www.%kw%” -site:%kw% h
“//www.%kw%” -site:%kw% j
“//www.%kw%” -site:%kw% k
“//www.%kw%” -site:%kw% l
“//www.%kw%” -site:%kw% z
“//www.%kw%” -site:%kw% x
“//www.%kw%” -site:%kw% c
“//www.%kw%” -site:%kw% v
“//www.%kw%” -site:%kw% b
“//www.%kw%” -site:%kw% n
“//www.%kw%” -site:%kw% m
“//www.%kw%” -site:%kw% 1
“//www.%kw%” -site:%kw% 2
“//www.%kw%” -site:%kw% 3
“//www.%kw%” -site:%kw% 4
“//www.%kw%” -site:%kw% 5
“//www.%kw%” -site:%kw% 6
“//www.%kw%” -site:%kw% 7
“//www.%kw%” -site:%kw% 8
“//www.%kw%” -site:%kw% 9
“//www.%kw%” -site:%kw% 0
“www.%kw%” -site:%kw% q
“www.%kw%” -site:%kw% w
“www.%kw%” -site:%kw% e
“www.%kw%” -site:%kw% r
“www.%kw%” -site:%kw% t
“www.%kw%” -site:%kw% y
“www.%kw%” -site:%kw% u
“www.%kw%” -site:%kw% i
“www.%kw%” -site:%kw% o
“www.%kw%” -site:%kw% p
“www.%kw%” -site:%kw% a
“www.%kw%” -site:%kw% s
“www.%kw%” -site:%kw% d
“www.%kw%” -site:%kw% f
“www.%kw%” -site:%kw% g
“www.%kw%” -site:%kw% h
“www.%kw%” -site:%kw% j
“www.%kw%” -site:%kw% k
“www.%kw%” -site:%kw% l
“www.%kw%” -site:%kw% z
“www.%kw%” -site:%kw% x
“www.%kw%” -site:%kw% c
“www.%kw%” -site:%kw% v
“www.%kw%” -site:%kw% b
“www.%kw%” -site:%kw% n
“www.%kw%” -site:%kw% m
“www.%kw%” -site:%kw% 1
“www.%kw%” -site:%kw% 2
“www.%kw%” -site:%kw% 3
“www.%kw%” -site:%kw% 4
“www.%kw%” -site:%kw% 5
“www.%kw%” -site:%kw% 6
“www.%kw%” -site:%kw% 7
“www.%kw%” -site:%kw% 8
“www.%kw%” -site:%kw% 9
“www.%kw%” -site:%kw% 0


Searching for Guest Posting Opportunities 

“%kw%” + guest blogger wanted
“%kw%” + guest writer
“%kw%” + guest blog post writer
“%kw%” + “write for us” OR “write for me”
“%kw%” + “Submit a blog post”
“%kw%” + “Become a contributor”
“%kw%” + “guest blogger”
“%kw%” + “Add blog post”
“%kw%” + “guest post”
“%kw%” + “Write for us”
“%kw%” + submit blog post
“%kw%” + “guest column”
“%kw%” + “contributing author”
“%kw%” + “Submit post”
“%kw%” + “submit one guest post”
“%kw%” + “write for us”
“%kw%” + “Suggest a guest post”
“%kw%” + “Send a guest post”
“%kw%” + “contributing writer”
“%kw%” + “Submit blog post”
“%kw%” + inurl:contributors
“%kw%” + “guest article OR post”
“%kw%” + add blog post
“%kw%” + “submit a guest post”
“%kw%” + “Become an author”
“%kw%” + submit post
“%kw%” + “submit your own guest post”
“%kw%” + “Contribute to our site”
“%kw%” + “Submit an article”
“%kw%” + “Add a blog post”
“%kw%” + “Submit a guest post”
“%kw%” + “Guest bloggers wanted”
“%kw%” + “guest column”
“%kw%” + “submit your guest post”
“%kw%” + “guest article”
“%kw%” + inurl:guest*posts
“%kw%” + Become guest writer
“%kw%” + inurl:guest*blogger
“%kw%” + “become a contributor” OR “contribute to this site”

More Info on searching for guest writing leads.


Expand a list of Root URLs with more related websites

related:”%kw%”


Search using keywords that are synonyms or similar to the main keyword

~%kw%


The Wild Card Operator

(this will not be a list you can merge with keywords like the others)

“Keyword * Keyword”
Example: “Post * Comment”

Will return results like:

To comment on a
post a comment.
post or comment
comment & analysis on news


Scraping Bing for keywords in Title/Meta Only

meta:”%kw%”
intitle:”%kw%”


Creating Better Footprint Lists

Search Operators for Each Search Engine:
Bing
Yahoo
Google

Keyword Zip file Downloads:
Common Numbers
Common Words

Uber Cool Extra ninja Bonus:

1.)footprint list that's been passed around a billion and a 1/2 times - Credit to unknown. 
2.) Scrapebox Generic Blog Comments (Pre-Spun) 
3.) Scrapebox's Footprint Forum

Free Online Text Tools:


Filtering unwanted urls

If you need a very targeted set of results. Then Creating a duplicate harvester engine, and using the Must be in link function with the keyword. Then only URLs with that keyword will be harvested and saved.

Note: Max URLs/Sec tends to really slow down, because you are removing so many other possibilities for results with that filter.


Remove Links from common sites when expanding a list for expired domains in link extractor

Common Keywords/Domains:
amazon
reddit
tumblr
google
pinterest
facebook
stumbleupon
digg
twitter
del.icio.us
youtube
freelancer
linkedin
wiki
huffingtonpost
?
wp-content
.jpg
.jpeg
.png
archive
/tag/
/category/
.bmp

List of Top 500 Websites
Download


Built-in Scrapebox Filters

Common Ones:

/tag/
?
/archive/
tumblr.com
bat.bing
youtube
.gov
jobs
.pdf