I recently found myself needing to scrape information from a website that uses login credentials. The authentication and session information was available in several cookies, which Wget could use, if the cookies were stored in a plain text file. I used Firefox to login and set the cookies, but Firefox saves it’s cookies in an sqlite data file, which must be exported before Wget can use it. A quick Google search turned up a few possible methods using sqlite3, which I’ve adapted here to use with Wget. I’ve also added some additional (example) code to extract hrefs and print them out, along with the webpage url. The script is called with the target url as the only command line argument.
# Example script to get content using Firefox cookies.
# by Jean-Sebastien Morisset (https://surniaulula.com/)
cookie_file="`echo $HOME/Library/Application\ Support/Firefox/Profiles/*.default/cookies.sqlite`"
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:15.0) Gecko/20100101 Firefox/15.0.1"
echo ".mode tabs
select host, case when host glob '.*' then 'TRUE' else 'FALSE' end,
path, case when isSecure then 'TRUE' else 'FALSE' end,
expiry, name, value from moz_cookies;" | \
sqlite3 "$cookie_file" | \
wget --load-cookies=/dev/stdin --user-agent="$user_agent" --output-document=/dev/stdout "$1" 2>/dev/null | \
sed -n -e "/>/G" -e "s/^.*href=['\"]\([^\"]*\)['\"].*$/\1/p" | \
while read line; do echo -e "$1\t$line"; done
Unless you use Mac OS X, you’ll probably want to update the
$cookie_file path and the
$user_agent value. ;-)
The sed expression does not print anything by default, adds a newline after a greater-than character (so each line cannot have more than one href), and if a line contains an href, it replaces the whole line with the href’s value and prints the resulting line.