Skip to content

Makre sure we re-parse documents in case there are parsing errors

Our download script is smart enough to re-download documents in case there were errors when fetching the latest ones, so we don't "lose" data in our DB:

    download_url=https://collector.torproject.org/recent/$p/$u
    log_file=$PARSER_HOME/logs/downloads.log
    if ! grep -q "$download_url" "$log_file"; then
      status=$(wget --server-response ${download_url} 2>&1 | awk '/^  HTTP/{print $2}')
      if [ "$status" = "200" ]; then
        echo "$download_url" >> $log_file
      fi
    fi

However, we don't have a good solution for the missing data issue caused by a successful download yet parser errors. In that case the next download won't include the older documents for re-parsing (yet).