Tinkering with GNU parallel and wget for broken link checking

Finally found a parallel spidering solution. Online solutions didn’t really fit, because I don’t want to overload the production site and they can’t reach http://localhost. Trying out parallel + wget snippet from https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Breadth-first-parallel-web-crawler-mirrorer looks promising.


 # Stay inside the start dir
 BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
 URLLIST=$(mktemp urllist.XXXX)
 URLLIST2=$(mktemp urllist.XXXX)
 SEEN=$(mktemp seen.XXXX)

# Spider to get the URLs
 echo $URL >$URLLIST

while [ -s $URLLIST ] ; do
 cat $URLLIST |
 parallel lynx -listonly -image_links -dump {} \; \
 wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
 perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and
 do { $seen{$1}++ or print }' |
 grep -F $BASEURL |
 grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2


Great exercise for the CPUs
htop gnu parallel

When the command finishes then the next step is parsing access_log

grep -r ' 404 ' /var/log/httpd/access_log | cut -d ' ' -f 7 | sed -r 's/^\//http\:\/\/localhost\//g'

Link checking as cache warmup or integration testing

Sometimes I use wget awesome recursive spidering (crawling) feature (alternative to linkchecker) beside broken link check also for cache warmup or looking for PHP errors after commit.

On production for some projects I use PHP error logging into separate log directory per day:

if (!is_dir(DOC_ROOT.'/log')) {

ini_set('display_errors', 0);
ini_set('display_startup_errors', 0);
ini_set('log_errors', 1);
ini_set('error_log', DOC_ROOT.'/log/php-error-'.date('d').'.log');

So after crawling I check if there is a log file.
And the wget shell script that crawls the site is:


timestamp=$(date +"%Y%m%d%H%M%S")

cd /tmp
time wget -4 --spider -r --delete-after \
     --no-cache --no-http-keep-alive --no-dns-cache \
     -U "Wget" \
     -X/blog \
     http://www.example.com -o"/tmp/wget-example-com-$timestamp.log"

During the crawling your site is hit and cache is generated if you have some implemented, for example phpFastCache.
I just only crawl via IPv4 without HTTP keep alive (only for better performance). Setting some unique user agent is good for parsing in access_log, too.
-X stands for excluding, also handy for improving performance on large sites.
-o outputs the wget status report where you can search for HTTP status codes as 404, 403, 500, etc.
Remember the excluded path, you might run another wget in parallel if you need. Unfortunately wget can’t run parallel threads as of writing.