It’s time for another wordlist creator script that scrapes websites and makes unique, sorted, utf-8 files… This time I added support for merging existing wordlist files. Rules are listed below, but you can modify them of course.
Rules list:
- Words must be longer than 8 characters
- Only alpha characters are acepted
- Entire wordlist is lowercase
- As I’ve already stated above, wordlist is uniquely sorted
#!/bin/bash --
#
# wordlist creator v1.0b
# usage wordlist.sh [url] [output]
# if output file exists, merge files
#
if [ "$1" -a "$2" ]
then
echo ">> $(date +'%T') - Starting to downlad $1. This can take a long time..."
#download websites recursively to ./temp/ directory, skip non-text files
wget -r -l 2 --random-wait --user-agent='Mozilla/5.0' --quiet -R .jpg,.jpeg,.png,.gif,.bmp,.flv,.js,.avi,.wmv,.mp3,.zip,.css,.pdf,.iso,.tar.gz,.rar,.swf,.PNG,.GIF,.JPG,.JPEG,.BMP -P "./temp/" $1
echo ">> $(date +'%T') - Finished downloading, creating wordlist..."
#rescursively search for words that match out criteria in all files, 8+ chars, alpha
page=`grep '' -R "./temp/" | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | tr " " "\n" | tr '[:upper:]' '[:lower:]' | sed -e '/[^a-zA-Z]/d' -e '/^.\{9,25\}$/!d' | sort -u`;
echo "`date +"%T"` - Wordlist created!"
echo ">> Fetched lines: $(echo "$page" | wc -l)"
if [ -f "$2" ]; then
echo ">> File $2 already exists, merging files!"
echo "$page" >> "$2";
cat "$2" | sort -u -o "$2";
echo ">> Wordlist merged with $2 and now has $(cat "$2" | wc -l) lines!";
else
echo "$page" > "$2"
echo ">> Wordlist saved to $2!"
fi
#remove temporaray website directory
rm -rf "./temp/"
else
echo ">> Error: Parameter URL required!"
echo ">> Example: $0 https://www.iana.org/domains/example/ ./wordlist.txt"
fi
Wow man I was looking for a code like this for a good while now, seriously, I was so glad to find it here.
It took about 6 hours to process 200 MB of downloaded html on a crappy virtual machine, became 500kb of sorted wordlist.
I noticed I had a lot of duplicate files. Is there a way to get rid of them before the process?
Is there any way to login to the selected site, so the script can download the otherwise restricted contents?
THANKS!
Temporary files are a problem of wget program, i can’t really help you on that one, but you can easily login to the site with wget if it uses the standard HTTP login – just add the –user and –password parameters:
http://www.cyberciti.biz/faq/wget-command-with-username-password/
If the website uses a html form to login, send the data with GET/POST and accept the cookies, then you will be logged in, but make sure to include the cookies in the second request. This is the harder way unfortunately.
Thanks!
An interesting addition would be a log file with statistic, especially for the wiki scraper.
To see the:
- total time it took to build the wordlist
- total processed pages
- total processed lines
- number of processed pages vs. number of newly added lines vs. number of lines were present in the wordlist
etc.
ATM it takes 8 minutes to download 500 wiki pages and it fetched 34642 lines in 5 seconds, and only added 15229 unique lines to a wordlist with 103121 lines.
I butchered your code, this is where I am now. (never seen bash before so…)
#!/bin/bash --
#
#scrapewiki.sh v0.1 by http://www.360percents.com
#UPDATE: OCT 23 2010
#
if [ "$1" ]
then
rm -rf "./wikitemp/"
mkdir "./wikitemp/"
echo "$(date +'%T') - Downloading started..."
for i in `seq 1 $1`
do
wget -r -l 2 --random-wait --user-agent='Mozilla/5.0' -q -R .jpg,.jpeg,.png,.gif,.bmp,.flv,.js,.avi,.wmv,.mp3,.zip,.css,.pdf,.iso,.tar.gz,.rar,.swf,.PNG,.GIF,.JPG,.JPEG,.BMP,.txt,.TXT -O "./wikitemp/$i" http://hu.wikipedia.org/wiki/Special:Random
echo -n "...$i\r"
done
echo ""
echo "$(date +'%T') - Finished downloading, creating wordlist..."
page=`grep '' -R "./wikitemp/" | sed -e :a -e 's/]*>//g;/> Fetched lines: $(echo "$page" | wc -l)"
if [ -f "$2" ]; then
echo ">> File $2 already exists, merging files!"
echo "$page" >> "$2";
cat "$2" | sort -u -o "$2";
echo ">> Wordlist merged with $2 and now has $(cat "$2" | wc -l) lines!";
else
echo "$page" > "$2"
echo ">> Wordlist saved to $2!"
fi
else
echo "Usage: $0 number_of_pages"
fi
Hi.
I used this script in an attempt to create a wordlist from the website below:
http://zerohora.clicrbs.com.br/rs/
It is a website from a brazilian newspaper. After almost an hour, the script gave me the following result:
FINISHED --2012-08-20 16:22:10--
Total wall clock time: 51m 45s
Downloaded: 1938 files, 244M in 27m 51s (149 KB/s)
>> 16:22:11 - Finished downloading, creating wordlist...
sed: RE error: illegal byte sequence
16:22:11 - Wordlist created!
>> Fetched lines: 1
>> Wordlist saved to wordlist.txt!
The file wordlist.txt created in the process is an empty file. I would like to know if somebody could help me with this error. According to the result above, it is related to sed (illegal byte sequence).
Thanks in advance,
toter.