Tinkering with GNU parallel and wget for broken link checking

Finally found a parallel spidering solution. Online solutions didn’t really fit, because I don’t want to overload the production site and they can’t reach http://localhost. Trying out parallel + wget snippet from https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Breadth-first-parallel-web-crawler-mirrorer looks promising.

#!/bin/bash

URL=$1
 # Stay inside the start dir
 BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
 URLLIST=$(mktemp urllist.XXXX)
 URLLIST2=$(mktemp urllist.XXXX)
 SEEN=$(mktemp seen.XXXX)

# Spider to get the URLs
 echo $URL >$URLLIST
 cp $URLLIST $SEEN

while [ -s $URLLIST ] ; do
 cat $URLLIST |
 parallel lynx -listonly -image_links -dump {} \; \
 wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
 perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and
 do { $seen{$1}++ or print }' |
 grep -F $BASEURL |
 grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
 mv $URLLIST2 $URLLIST
 done

rm -f $URLLIST $URLLIST2 $SEEN

Great exercise for the CPUs
htop gnu parallel

When the command finishes then the next step is parsing access_log

grep -r ' 404 ' /var/log/httpd/access_log | cut -d ' ' -f 7 | sed -r 's/^\//http\:\/\/localhost\//g'
Advertisements

Author: Michal Zuber

Full stack developer, biker and rollerblader. Owner and developer at https://nevilleweb.sk/ Co-founded http://neville.sk/ Blog at https://michalzuber.wordpress.com/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s