Skip to content

Fix SEC IAPD scraping after website rebuild to Angular SPA#17

Open
abresler wants to merge 2 commits intomasterfrom
fix/sec-iapd-website-rebuild
Open

Fix SEC IAPD scraping after website rebuild to Angular SPA#17
abresler wants to merge 2 commits intomasterfrom
fix/sec-iapd-website-rebuild

Conversation

@abresler
Copy link
Owner

Summary

  • Add new REST API integration for SEC IAPD (api.adviserinfo.sec.gov) to retrieve basic manager data
  • Update HTML scraping functions to use files.adviserinfo.sec.gov domain where legacy ASP.NET pages still work
  • Fix .parse_sec_manager_pdf_url() to correctly extract brochure URLs from the new domain
  • Fix multiple bugs in Form ADV section scrapers (deprecated tidyr syntax, NULL handling, checkbox matching)
  • All 18 Form ADV section scraping functions now pass tests

Test plan

  • Verified all 18 Form ADV section scrapers work with test CRD 156663
  • Verified .parse_sec_manager_pdf_url() extracts brochure URL correctly
  • Verified flat tibble output (97-111 columns per CRD)

🤖 Generated with Claude Code

The SEC rebuilt their IAPD website from ASP.NET to Angular SPA, breaking
all HTML scraping. This update:

- Add new REST API integration (api.adviserinfo.sec.gov) for basic data
- Update HTML scraping to use files.adviserinfo.sec.gov (legacy pages still work)
- Fix .parse_sec_manager_pdf_url() to use correct domain
- Fix deprecated unnest() calls (tidyr syntax update)
- Fix Section 8 checkbox matching (left_join instead of bind_cols)
- Fix Schedule D NULL handling in dataTable entries
- Fix Signature Page undefined variable error
- Add conditional checks for optional columns (idLEI, pctClientsNonUS)
- Return flat tibbles instead of nested dataTable columns
- All 18 Form ADV section scrapers now working

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings January 21, 2026 21:47
table_name <-
list('manager',
all_data$nameADVPage %>% str_replace_all('\\ ', '')) %>%
all_data$nameTable %>% str_replace_all('\\ ', '')) %>%
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Column Name Mismatch

This line references all_data$nameTable, but when the HTML scraping path is used (which is the primary path), the data structure has a column named nameADVPage, not nameTable.

Condition when bug occurs:

  1. section_names has exactly 1 element, AND
  2. assign_to_environment = TRUE, AND
  3. HTML scraping succeeds (not API fallback)

Result: all_data$nameTable returns NULL, causing str_replace_all() to fail.

Suggested fix:

Suggested change
all_data$nameTable %>% str_replace_all('\\ ', '')) %>%
all_data$nameADVPage %>% str_replace_all('\\ ', '')) %>%

Or unify the column naming between HTML scraping and API paths.


View code context

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes SEC IAPD scraping functionality after the SEC completely rebuilt their Investment Adviser Public Disclosure (IAPD) website from ASP.NET to an Angular Single Page Application. The changes add new REST API integration while maintaining backward compatibility with legacy HTML scraping where available.

Changes:

  • Added new REST API integration for api.adviserinfo.sec.gov to retrieve basic manager data
  • Updated HTML scraping to use files.adviserinfo.sec.gov domain where legacy pages still function
  • Fixed multiple bugs in Form ADV section scrapers including deprecated tidyr syntax, NULL handling, and checkbox matching
  • Replaced tabulizer package dependency with tabulapdf across the codebase
  • Added CLAUDE.md documentation file for AI assistant guidance

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
R/adv_functions.R Major refactor: added 350+ lines of new API integration functions, updated all URLs to files.adviserinfo.sec.gov, fixed NULL handling in idLEI check, fixed deprecated unnest() calls, improved checkbox matching logic, added API fallback for when HTML scraping fails
R/sec_functions.R Package import update: replaced tabulizer with tabulapdf in 4 locations
R/nareit.R Package import update: replaced tabulizer with tabulapdf in 4 locations
DESCRIPTION Updated RoxygenNote version and replaced tabulizer with tabulapdf dependency
NAMESPACE Updated import statement from tabulizer to tabulapdf
CLAUDE.md New documentation file providing package overview, architecture, and development guidance for AI assistants

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


names(data)

data
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function .parse_crd_json appears to be incomplete or unused. It parses JSON data and extracts the data object but doesn't return it explicitly (line 809 just evaluates data without returning it). Additionally, the function is not called anywhere in the visible changes. Consider either removing this function if it's not needed, or ensuring it properly returns the data with an explicit return(data) statement.

Suggested change
data
return(data)

Copilot uses AI. Check for mistakes.
str_replace_all('&', 'AND') %>%
unique()
}
signature_df <- tibble(nameEntityManagerSignatory = NA_character_)
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialization signature_df <- tibble(nameEntityManagerSignatory = NA_character_) at line 5968 serves as a fallback, but this means when nodes has length 0 or 1, the function will return a tibble with only NA values without any indication that parsing failed. This could silently mask data extraction issues. Consider adding explicit handling for these edge cases, or at least logging a warning when the nodes length is unexpected (0, 1, or other values not handled by the subsequent if statements).

Copilot uses AI. Check for mistakes.
Comment on lines +3508 to +3513
# Convert pctClientsNonUS to decimal if column exists
if ('pctClientsNonUS' %in% names(employee_count_df)) {
employee_count_df <- employee_count_df %>%
mutate(pctClientsNonUS = pctClientsNonUS / 100)
}

Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this PR fixes one instance of deprecated mutate_at() syntax by replacing it with a conditional check, there are still numerous uses of deprecated dplyr functions like mutate_at() with .funs parameter throughout this file. The mutate_at() function was superseded by across() in dplyr 1.0.0 (2020). For consistency and to avoid future deprecation warnings when these functions are eventually removed, consider creating a follow-up issue to modernize all instances to use mutate(across(...)) patterns.

Copilot uses AI. Check for mistakes.

### Data Access Patterns

Web scraping uses `rvest` with CSS selectors. API calls use `httr`/`curl` with `jsonlite` for JSON. Parallel operations use `furrr::future_map_dfr()` - enable with `future::plan(multiprocess)`.
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference to future::plan(multiprocess) is outdated. The multiprocess strategy was deprecated in the future package version 1.20.0 (November 2020) and should be replaced with multisession for parallel processing on all operating systems. Update line 75 to recommend future::plan(multisession) instead.

Suggested change
Web scraping uses `rvest` with CSS selectors. API calls use `httr`/`curl` with `jsonlite` for JSON. Parallel operations use `furrr::future_map_dfr()` - enable with `future::plan(multiprocess)`.
Web scraping uses `rvest` with CSS selectors. API calls use `httr`/`curl` with `jsonlite` for JSON. Parallel operations use `furrr::future_map_dfr()` - enable with `future::plan(multisession)`.

Copilot uses AI. Check for mistakes.
if (is.null(api_data)) {
return(tibble())
}

Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function .get_manager_info_from_api doesn't validate that api_data$iacontent_parsed exists before accessing api_data$iacontent_parsed$basicInformation on line 114. If iacontent_parsed is NULL or missing, this will cause an error. Consider adding a check similar to what's done in .parse_api_sections_data (line 134) to verify the structure before accessing nested fields.

Suggested change
# Safely handle cases where iacontent_parsed or basicInformation are missing
if (is.null(api_data$iacontent_parsed) ||
!is.list(api_data$iacontent_parsed) ||
is.null(api_data$iacontent_parsed$basicInformation)) {
return(tibble())
}

Copilot uses AI. Check for mistakes.
typeIDSEC = basic_info$iaSECNumberType %||% NA_character_,
statusIA = basic_info$iaScope %||% NA_character_,
dateADVFiling = basic_info$advFilingDate %||% NA_character_,
hasPDF = basic_info$hasPdf == "Y"
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 123 has a potential issue: when basic_info$hasPdf is NULL, the expression basic_info$hasPdf == "Y" will return logical(0) rather than FALSE, which could cause downstream issues. Consider using (basic_info$hasPdf %||% "N") == "Y" similar to line 153 to ensure it always returns a logical value.

Suggested change
hasPDF = basic_info$hasPdf == "Y"
hasPDF = (basic_info$hasPdf %||% "N") == "Y"

Copilot uses AI. Check for mistakes.
.get_manager_sec_page <-
function(url = 'http://www.adviserinfo.sec.gov/IAPD/IAPDFirmSummary.aspx?ORG_PK=156663') {
function(url = 'https://files.adviserinfo.sec.gov/IAPD/IAPDFirmSummary.aspx?ORG_PK=156663') {
httr::set_config(httr::config(ssl_verifypeer = 0L))
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to httr::set_config(httr::config(ssl_verifypeer = 0L)) globally disables TLS certificate verification for all httr requests in the R session, allowing man-in-the-middle attackers with self-signed or invalid certificates to intercept and tamper with HTTPS traffic (including SEC IAPD data). Because this is a global setting, any other httr::GET/POST calls in the same process will also skip certificate checks, undermining the integrity and confidentiality guarantees of HTTPS. Replace this with the default secure verification behavior (or a narrowly scoped per-request configuration) so certificates are validated normally.

Suggested change
httr::set_config(httr::config(ssl_verifypeer = 0L))

Copilot uses AI. Check for mistakes.
…lution

Major changes:
- Add Ferrara-inspired console messaging system (cli-based)
- Fix adv_managers_filings() furrr/possibly namespace bug
- Fix pcaob_auditors() for changed PCAOB data structure
- Fix cb_unicorns() for new CB Insights city column
- Replace rio::import with curl-based .import_url_curl() helper
- Remove all scipen global option pollution (23 occurrences)
- Update CLAUDE.md with messaging documentation
- Modernize deprecated tidyverse patterns across 21 files

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant