if (!require(devtools)) install.packages("devtools")

devtools::install_github("snfagora/MapAgora")
library(MapAgora)

About pages from websites

To help track the information on nonprofits not captured by their tax filings, we developed code to extract the text that organizations use to describe themselves on their websites.

Find About Pages

The most common place to find informative text about an organization is on its website, particularly its About Page. The About Page often contains information about the group’s mission, notable activities, and history.

There is no single standard way to find an About Page, but many common patterns predominate. Our strategy is to begin by extracting all the links from an organization’s home page. Then, we look for links containing the string “about” or “who.” If we fail to find these patterns, we try to identify other less direct common patterns.

Websites can have multiple About Pages, often nested under a common heading. We first need to identify all About Pages for a selected website.

# find this site's about pages
about_pages <- extract_about_links("http://www.ibew567.com")

Extract About Text

Websites typically contain a great deal of code that is not directly displayed in the use case. In addition, displayed text often includes menus, headers, footers, or other small text snippets. Using natural language processing (NLP) can help clean up the text, but it is tedious and often produces low-quality output.

Our approach is to filter out text that is not in sentences with at least a certain threshold to extract salient text. Also, when extracting text from multiple pages, we look for and remove shared header and footer language to avoid duplicating it in the final corpus.

# find the combined text of about pages and home page
org_web_text <- get_all_texts("http://www.ibew567.com")