Link checking as cache warmup or integration testing

Sometimes I use wget awesome recursive spidering (crawling) feature (alternative to linkchecker) beside broken link check also for cache warmup or looking for PHP errors after commit.

On production for some projects I use PHP error logging into separate log directory per day:

if (!is_dir(DOC_ROOT.'/log')) {

ini_set('display_errors', 0);
ini_set('display_startup_errors', 0);
ini_set('log_errors', 1);
ini_set('error_log', DOC_ROOT.'/log/php-error-'.date('d').'.log');

So after crawling I check if there is a log file.
And the wget shell script that crawls the site is:


timestamp=$(date +"%Y%m%d%H%M%S")

cd /tmp
time wget -4 --spider -r --delete-after \
     --no-cache --no-http-keep-alive --no-dns-cache \
     -U "Wget" \
     -X/blog \ -o"/tmp/wget-example-com-$timestamp.log"

During the crawling your site is hit and cache is generated if you have some implemented, for example phpFastCache.
I just only crawl via IPv4 without HTTP keep alive (only for better performance). Setting some unique user agent is good for parsing in access_log, too.
-X stands for excluding, also handy for improving performance on large sites.
-o outputs the wget status report where you can search for HTTP status codes as 404, 403, 500, etc.
Remember the excluded path, you might run another wget in parallel if you need. Unfortunately wget can’t run parallel threads as of writing.


About Michal Zuber

Biker and rollerblader. Owner and developer at Co-founded Blog at
This entry was posted in bestpractice, Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s