Tinkering with GNU parallel and wget for broken link checking

Finally found a parallel spidering solution. Online solutions didn’t really fit, because I don’t want to overload the production site and they can’t reach http://localhost. Trying out parallel + wget snippet from https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Breadth-first-parallel-web-crawler-mirrorer looks promising.

#!/bin/bash

URL=$1
 # Stay inside the start dir
 BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
 URLLIST=$(mktemp urllist.XXXX)
 URLLIST2=$(mktemp urllist.XXXX)
 SEEN=$(mktemp seen.XXXX)

# Spider to get the URLs
 echo $URL >$URLLIST
 cp $URLLIST $SEEN

while [ -s $URLLIST ] ; do
 cat $URLLIST |
 parallel lynx -listonly -image_links -dump {} \; \
 wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
 perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and
 do { $seen{$1}++ or print }' |
 grep -F $BASEURL |
 grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
 mv $URLLIST2 $URLLIST
 done

rm -f $URLLIST $URLLIST2 $SEEN

Great exercise for the CPUs
htop gnu parallel

When the command finishes then the next step is parsing access_log

grep -r ' 404 ' /var/log/httpd/access_log | cut -d ' ' -f 7 | sed -r 's/^\//http\:\/\/localhost\//g'
Advertisements

Passwordless ssh not working

I was getting the following with ssh -v user@remote_host

debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /home/mike/.ssh/id_rsa
debug2: we sent a publickey packet, wait for reply
debug1: Authentications that can continue: publickey,password
debug2: we did not send a packet, disable method
debug1: Next authentication method: password

The solution idea came from http://askubuntu.com/a/90465/168459 to fix .ssh dir permissions and .ssh/authorized_keys

Later during investigation after login with password and debug turned on SSH complained with:

debug1: Remote: Ignored authorized keys: bad ownership or modes for file /home/REMOTE_HOST_USER/.ssh/authorized_keys

Switched from Console.app to multitail

With Console.app I had the problem that when I switched to output /var/log/apache2/error_log I didn’t see
/var/log/system.log and yesterday I read http://kkovacs.eu/cool-but-obscure-unix-tools and found multitail so I played with it.

I’m currently running the following setup on OSX Mountain Lion


mike@mikembp:~$ cat bin/multitail-log.sh
#!/bin/bash

multitail -s 2 /tmp/lsof-net.log \
/var/log/apache2/error_log \
/var/log/system.log -I /var/log/wifi.log -I /var/log/mail.log \
/var/log/mysql.log

I had to make a crontab to get /tmp/lsof-net.log file, because multitail -R 2 -l “lsof lsof -RPi4 +c15” was crashing with “Operation not permitted”. I think the problem is that lsofon Mac is in /usr/sbin. Crons minimal execution is every minute so I had to call the desired command 29 times with 2 second sleep.


mike@mikembp:~$ cat bin/cron-netlog.sh
#!/bin/bash

# crontab -e
# * * * * * /Users/mike/bin/cron-netlog.sh

LOGFILE=/tmp/lsof-net.log

for (( i=1; i <= 29; i++ ))
do
/usr/sbin/lsof -RPi4 +c15 | grep -v -e rtorrent -e Mail -e Last | awk '{print $1,$2,$3,$4,$9,$10}' | column -t >> $LOGFILE
sleep 2
done

mike@mikembp:~$ cat .crontab
# ~/.crontab
#
# Run:
# crontab ~/.crontab

MAILTO=user@example.com

* * * * * ~/bin/cron-netlog.sh

mike@mikembp:~$ crontab .crontab

Network and bandwidth monitoring with darkstat

I was searching for a networking monitor solution and found http://hints.macworld.com/article.php?story=20020521011343792

Darkstat’s source code is available at http://unix4lyfe.org/darkstat/After starting it runs as a daemon in the background.


mike@mikembp:~$ sudo darkstat -i en0

It binds itself to the TCP port 667 which can be changed and also other things:


mike@mikembp:~$ darkstat --help
darkstat 3.0.718 (using libpcap version 1.1.1)

usage: darkstat [ -i interface ]
[ -f filter ]
[ -r capfile ]
[ -p port ]
[ -b bindaddr ]
[ -l network/netmask ]
[ --base path ]
[ --local-only ]
[ --snaplen bytes ]
[ --pppoe ]
[ --syslog ]
[ --verbose ]
[ --no-daemon ]
[ --no-promisc ]
[ --no-dns ]
[ --no-macs ]
[ --no-lastseen ]
[ --chroot dir ]
[ --user username ]
[ --daylog filename ]
[ --import filename ]
[ --export filename ]
[ --pidfile filename ]
[ --hosts-max count ]
[ --hosts-keep count ]
[ --ports-max count ]
[ --ports-keep count ]
[ --highest-port port ]
[ --wait secs ]
[ --hexdump ]
[ --version ]
[ --help ]

Additional resources, info:
https://thejimmahknows.com/network-monitoring-ntop-vs-darkstat/http://slackblogs.blogspot.com/2011/06/monitor-traffic-usage-using-darkstat.html