Tinkering with GNU parallel and wget for broken link checking

Finally found a parallel spidering solution. Online solutions didn’t really fit, because I don’t want to overload the production site and they can’t reach http://localhost. Trying out parallel + wget snippet from https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Breadth-first-parallel-web-crawler-mirrorer looks promising.

#!/bin/bash

URL=$1
 # Stay inside the start dir
 BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
 URLLIST=$(mktemp urllist.XXXX)
 URLLIST2=$(mktemp urllist.XXXX)
 SEEN=$(mktemp seen.XXXX)

# Spider to get the URLs
 echo $URL >$URLLIST
 cp $URLLIST $SEEN

while [ -s $URLLIST ] ; do
 cat $URLLIST |
 parallel lynx -listonly -image_links -dump {} \; \
 wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
 perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and
 do { $seen{$1}++ or print }' |
 grep -F $BASEURL |
 grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
 mv $URLLIST2 $URLLIST
 done

rm -f $URLLIST $URLLIST2 $SEEN

Great exercise for the CPUs
htop gnu parallel

When the command finishes then the next step is parsing access_log

grep -r ' 404 ' /var/log/httpd/access_log | cut -d ' ' -f 7 | sed -r 's/^\//http\:\/\/localhost\//g'
Advertisements

First experience with RAID0

In my new desktop PC I have Intel Rapid Storage technology so gave it a try. Bought a second Kingston V300 SSD to put it in RAID0.

Intel Rapid Storage technology
Intel Rapid Storage technology

I was curious about the performance with and without RAID0 of the SSDs. So I run hdparm -tT on them.
The results:

Kingston SSD performance
Kingston SSD performance

Kingston SSD performance

The old SSD slowed down the RAID array so I sold it and installed OS on the new SSD, because it was 3x faster and I didn’t need the extra space with two SSDs in RAID.
Unfortunately I don’t remember the speed of the old SSD when I bouht it. Now I saved the new ones to be able to compare after some years.
It’s mindblowing how different/super fast is my PC with the new SSD.

Check out this awesome article about swapping https://rudd-o.com/linux-and-free-software/tales-from-responsivenessland-why-linux-feels-slow-and-how-to-fix-that

For disk benchmarking on Linux checkout https://wiki.archlinux.org/index.php/Benchmarking/Data_storage_devices

Browser memory usage

I was curious why did Chromium “eat” my system memory and so I checked chrome://chrome-urls/ to see what’s there for memory usage debugging.

Saw chrome://memory-internals/ and checked it out. There’s the proof for browser tab and extension usage.

ss_2016-01-31-06-40
Chrome memory usage data from chrome://memory-internals

Tried closing and reopening some tabs, but before that I typed free -h to check memory usage.

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           3.9G        1.9G        1.3G         53M        682M        1.8G
Swap:          2.8G        276M        2.5G
$ free -h
              total        used        free      shared  buff/cache   available
Mem:           3.9G        1.9G        1.3G         53M        684M        1.9G
Swap:          2.8G        275M        2.5G
$ free -h
              total        used        free      shared  buff/cache   available
Mem:           3.9G        2.0G        1.2G         58M        694M        1.7G
Swap:          2.8G        273M        2.5

The numbers were not lying. So if you working with a lot of tabs and extensions more RAM will benefit your system.

Similiar goodies are available also in Firefox. At http://kb.mozillazine.org/About_protocol_links you can see a list of internal about:* sites of Firefox.

For memory usage visit about:memory and click Measure

ss_2016-01-31-06-57
Firefox memory usage

Enable MySQL file logging

Wanted to know what queries are executed so I had to enable it for MariaDB.

SET GLOBAL general_log = 'ON';
SET GLOBAL general_log_file = '/var/log/mysql.log';
SET GLOBAL slow_query_log = 'On';
SET GLOBAL slow_query_log_file = '/var/log/mysql-slow.log';

To see the changes:

SHOW VARIABLES LIKE 'general_log';
SHOW VARIABLES LIKE 'slow_query_log';

logrotate error: stat of /var/log/xferlog failed

In Arch Linux the logrotate service was failing:

user@host# systemctl start logrotate.service 
Job for logrotate.service failed. See "systemctl status logrotate.service" and "journalctl -xe" for details.

So I ran it by hand to debug:

user@host# logrotate /etc/logrotate.conf

And the following error appeared:

logrotate error: stat of /var/log/xferlog failed

I fired up a grep for xferlog in the /etc directory:

user@host:/etc# grep -r xferlog *
logrotate.d/proftpd:/var/log/xferlog

Solution was commenting the xferlog rule in logrotate.d/proftpd. I don’t need the transfer log.

Switched from Console.app to multitail

With Console.app I had the problem that when I switched to output /var/log/apache2/error_log I didn’t see
/var/log/system.log and yesterday I read http://kkovacs.eu/cool-but-obscure-unix-tools and found multitail so I played with it.

I’m currently running the following setup on OSX Mountain Lion


mike@mikembp:~$ cat bin/multitail-log.sh
#!/bin/bash

multitail -s 2 /tmp/lsof-net.log \
/var/log/apache2/error_log \
/var/log/system.log -I /var/log/wifi.log -I /var/log/mail.log \
/var/log/mysql.log

I had to make a crontab to get /tmp/lsof-net.log file, because multitail -R 2 -l “lsof lsof -RPi4 +c15” was crashing with “Operation not permitted”. I think the problem is that lsofon Mac is in /usr/sbin. Crons minimal execution is every minute so I had to call the desired command 29 times with 2 second sleep.


mike@mikembp:~$ cat bin/cron-netlog.sh
#!/bin/bash

# crontab -e
# * * * * * /Users/mike/bin/cron-netlog.sh

LOGFILE=/tmp/lsof-net.log

for (( i=1; i <= 29; i++ ))
do
/usr/sbin/lsof -RPi4 +c15 | grep -v -e rtorrent -e Mail -e Last | awk '{print $1,$2,$3,$4,$9,$10}' | column -t >> $LOGFILE
sleep 2
done

mike@mikembp:~$ cat .crontab
# ~/.crontab
#
# Run:
# crontab ~/.crontab

MAILTO=user@example.com

* * * * * ~/bin/cron-netlog.sh

mike@mikembp:~$ crontab .crontab