Wordlist creator script #2

It’s time for another wordlist creator script that scrapes websites and makes unique, sorted, utf-8 files… This time I added support for merging existing wordlist files. Rules are listed below, but you can modify them of course.

Rules list:

  • Words must be longer than 8 characters
  • Only alpha characters are acepted
  • Entire wordlist is lowercase
  • As I’ve already stated above, wordlist is uniquely sorted

#!/bin/bash --
#
# wordlist creator v1.0b
# usage wordlist.sh [url] [output]
# if output file exists, merge files
#

if [ "$1" -a "$2" ]
then
	echo ">> $(date +'%T') - Starting to downlad $1. This can take a long time..."

	#download websites recursively to ./temp/ directory, skip non-text files
	wget -r -l 2 --random-wait --user-agent='Mozilla/5.0' --quiet -R .jpg,.jpeg,.png,.gif,.bmp,.flv,.js,.avi,.wmv,.mp3,.zip,.css,.pdf,.iso,.tar.gz,.rar,.swf,.PNG,.GIF,.JPG,.JPEG,.BMP -P "./temp/" $1

	echo ">> $(date +'%T') - Finished downloading, creating wordlist..."

	#rescursively search for words that match out criteria in all files, 8+ chars, alpha
	page=`grep '' -R "./temp/" | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | tr " " "\n" | tr '[:upper:]' '[:lower:]' | sed -e '/[^a-zA-Z]/d' -e '/^.\{9,25\}$/!d' | sort -u`;

	echo "`date +"%T"` - Wordlist created!"
	echo ">> Fetched lines: $(echo "$page" | wc -l)"

	if [ -f "$2" ]; then
		echo ">> File $2 already exists, merging files!"
		echo "$page" >> "$2";
		cat "$2" | sort -u -o "$2";
		echo ">> Wordlist merged with $2 and now has $(cat "$2" | wc -l) lines!";
	else
		echo "$page" > "$2"
		echo ">> Wordlist saved to $2!"
	fi

	#remove temporaray website directory
	rm -rf "./temp/"
else
	echo ">> Error: Parameter URL required!"
	echo ">> Example: $0 https://www.iana.org/domains/example/ ./wordlist.txt"
fi

5 thoughts on “Wordlist creator script #2

  1. Wow man I was looking for a code like this for a good while now, seriously, I was so glad to find it here.

    It took about 6 hours to process 200 MB of downloaded html on a crappy virtual machine, became 500kb of sorted wordlist.

    I noticed I had a lot of duplicate files. Is there a way to get rid of them before the process?
    Is there any way to login to the selected site, so the script can download the otherwise restricted contents?

    THANKS!

    • Temporary files are a problem of wget program, i can’t really help you on that one, but you can easily login to the site with wget if it uses the standard HTTP login – just add the –user and –password parameters:
      http://www.cyberciti.biz/faq/wget-command-with-username-password/

      If the website uses a html form to login, send the data with GET/POST and accept the cookies, then you will be logged in, but make sure to include the cookies in the second request. This is the harder way unfortunately.

      • Thanks!

        An interesting addition would be a log file with statistic, especially for the wiki scraper.
        To see the:
        - total time it took to build the wordlist
        - total processed pages
        - total processed lines
        - number of processed pages vs. number of newly added lines vs. number of lines were present in the wordlist
        etc.

        ATM it takes 8 minutes to download 500 wiki pages and it fetched 34642 lines in 5 seconds, and only added 15229 unique lines to a wordlist with 103121 lines.

        I butchered your code, this is where I am now. (never seen bash before so…)

        #!/bin/bash --
        #
        #scrapewiki.sh v0.1 by http://www.360percents.com
        #UPDATE: OCT 23 2010
        #
        if [ "$1" ]
        then
        rm -rf "./wikitemp/"
        mkdir "./wikitemp/"
        echo "$(date +'%T') - Downloading started..."
        for i in `seq 1 $1`
        do
        wget -r -l 2 --random-wait --user-agent='Mozilla/5.0' -q -R .jpg,.jpeg,.png,.gif,.bmp,.flv,.js,.avi,.wmv,.mp3,.zip,.css,.pdf,.iso,.tar.gz,.rar,.swf,.PNG,.GIF,.JPG,.JPEG,.BMP,.txt,.TXT -O "./wikitemp/$i" http://hu.wikipedia.org/wiki/Special:Random
        echo -n "...$i\r"
        done
        echo ""
        echo "$(date +'%T') - Finished downloading, creating wordlist..."

        page=`grep '' -R "./wikitemp/" | sed -e :a -e 's/]*>//g;/> Fetched lines: $(echo "$page" | wc -l)"

        if [ -f "$2" ]; then
        echo ">> File $2 already exists, merging files!"
        echo "$page" >> "$2";
        cat "$2" | sort -u -o "$2";
        echo ">> Wordlist merged with $2 and now has $(cat "$2" | wc -l) lines!";
        else
        echo "$page" > "$2"
        echo ">> Wordlist saved to $2!"
        fi
        else
        echo "Usage: $0 number_of_pages"
        fi

  2. Hi.

    I used this script in an attempt to create a wordlist from the website below:

    http://zerohora.clicrbs.com.br/rs/

    It is a website from a brazilian newspaper. After almost an hour, the script gave me the following result:


    FINISHED --2012-08-20 16:22:10--
    Total wall clock time: 51m 45s
    Downloaded: 1938 files, 244M in 27m 51s (149 KB/s)
    >> 16:22:11 - Finished downloading, creating wordlist...
    sed: RE error: illegal byte sequence
    16:22:11 - Wordlist created!
    >> Fetched lines: 1
    >> Wordlist saved to wordlist.txt!

    The file wordlist.txt created in the process is an empty file. I would like to know if somebody could help me with this error. According to the result above, it is related to sed (illegal byte sequence).

    Thanks in advance,

    toter.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>