Makre sure we re-parse documents in case there are parsing errors
Our download script is smart enough to re-download documents in case there were errors when fetching the latest ones, so we don't "lose" data in our DB:
download_url=https://collector.torproject.org/recent/$p/$u
log_file=$PARSER_HOME/logs/downloads.log
if ! grep -q "$download_url" "$log_file"; then
status=$(wget --server-response ${download_url} 2>&1 | awk '/^ HTTP/{print $2}')
if [ "$status" = "200" ]; then
echo "$download_url" >> $log_file
fi
fi
However, we don't have a good solution for the missing data issue caused by a successful download yet parser errors. In that case the next download won't include the older documents for re-parsing (yet).