Scraping bill text and metadata
The steps below will create the baseline bill text and metadata directories that are used by this project. The initial scraping may take a very long time (1 or 2 days). Updates are done through the Celery task runner.
These instructions use the scrapers maintained in
Set up the database
Before downloading bill text and metadata with Django commands, ensure that the
postgresql database is ready set up.
Please see django_database.adoc for details.
+ 2. Activate the virtual environment and check if modules are imported.
If you get a
module not found error, make sure that:
you are in the Python virtual environment
environment variables are set
requirements.txt has been run previously in the virtual environment
cd /path/to/FlatGov/server_py pip install -r requirements.txt
+ 3. Download bill text and metadata
Run the django command
python manage.py update_bill
It will download bills (113 ~ 116) using uscongress open source scraper and store them into
This process will take several hours.
We set up a logging system, so we can keep tracking of the download status via the admin console.
Update bill data in the database and index into Elasticsearch
Create billsMeta.json file via Django command.
$ python manage.py bill_data
Process bills metadata
$ python manage.py process_bill_meta
Create related bills
$ python manage.py related_bills
Update loading of the
newbills into Elasticsearch
$ python manage.py elastic_load --uscongress
Calculate bill similarity and store in db
$ python manage.py bill_similarity --uscongress
Alternative: Manual scraping
This is only necessary if you want the text files (e.g. .txt, .html, .pdf) in the
congress directory. It is more complicated and adds ~10Gb of data above the downloads from ProPublica which are automated in
The text of bills can be scraped with the Python project here:
https://github.com/unitedstates/congress. First, clone this repository as a child of the
FlatGovDir directory above, and a sibling of
congress. In this way, running the scraper will fill out the text data within the
$ cd /path/to/FlatGovDir $ git clone https://github.com/unitedstates/congress.git (no credentials needed-- it is an open repo)
Install scraper dependencies
congress Python virtual environment, install the requirements (
pip install requirements.txt). The scrapers were built with Python 2.7 and have not been upgraded; updates may be needed for a production environment, but the
@unitedstates/congress scraper is sufficient to gather the baseline data to test the utilities in this repository.
|Scraping the initial data can be very time-consuming (most of a day, depending on your internet download speeds). To get started, it is worth finding a source for bulk downloads of the text, if possible.|
On MacOS (Catalina), installing the
congress requirements involved a few adjustments:
Install OpenSSL 1.02 with Homebrew. The latest OpenSSL (>1.1) causes problems with certain requirements; unfortunately, version 1.0.0 also failed. A script was set up by a Github user to install version 1.0.2.
brew uninstall openssl --ignore-dependencies; brew uninstall openssl --ignore-dependencies; brew uninstall libressl --ignore-dependencies; brew install https://raw.githubusercontent.com/Homebrew/homebrew-core/8b9d6d688f483a0f33fcfc93d433de501b9c3513/Formula/openssl.rb;
Link the OpenSSL libraries
export LDFLAGS=-L/usr/local/opt/openssl/lib export CPPFLAGS=-I/usr/local/opt/openssl/include
pip install pytz pip install pep517 pip install cryptography
congress repository directory,
pip install -r requirements.txt
./run govinfo --bulkdata=BILLSTATUS ./run bills
When running initially, I got an error because the bulk directories had not been made. To unzip the files manually in all directories:
find . -name "*.zip" | xargs -P 5 -I fileName sh -c 'unzip -o -d "$(dirname "fileName")/$(basename -s .zip "fileName")" "fileName"'
Statements of Administration Policy
Instructions for loading the database fixture for the Statements of Administration Policy are in the
DATA BACKGROUND document, here: DATA BACKGROUND: Statement of Administration Policy.
The scraper for CRS Reports, and its instructions, are described in CRS_REPORTS_SCRAPER.
Relevant Committee Documents
To load Relevant Committee Documents data use the following instructions:
After installing the requirements under scrapers directory, run crec_scrape_urls.py file under scrapers directory.
Go to the crec_scrapy folder and run “scrapy crawl crec” command. It will take about an hour to scrape all the data in crec_scrapy/data/crec_data.json file.
Copy scraped data from crec_scrapy/data/crec_data.json to django base directory. First delete old data under django base directory or replace it.
Run django command “./manage.py load_crec” command to populate the data to the database.