Fix SEC IAPD scraping after website rebuild to Angular SPA by abresler · Pull Request #17 · abresler/fundManageR

abresler · 2026-01-21T21:47:45Z

Summary

Add new REST API integration for SEC IAPD (api.adviserinfo.sec.gov) to retrieve basic manager data
Update HTML scraping functions to use files.adviserinfo.sec.gov domain where legacy ASP.NET pages still work
Fix .parse_sec_manager_pdf_url() to correctly extract brochure URLs from the new domain
Fix multiple bugs in Form ADV section scrapers (deprecated tidyr syntax, NULL handling, checkbox matching)
All 18 Form ADV section scraping functions now pass tests

Test plan

Verified all 18 Form ADV section scrapers work with test CRD 156663
Verified .parse_sec_manager_pdf_url() extracts brochure URL correctly
Verified flat tibble output (97-111 columns per CRD)

🤖 Generated with Claude Code

The SEC rebuilt their IAPD website from ASP.NET to Angular SPA, breaking all HTML scraping. This update: - Add new REST API integration (api.adviserinfo.sec.gov) for basic data - Update HTML scraping to use files.adviserinfo.sec.gov (legacy pages still work) - Fix .parse_sec_manager_pdf_url() to use correct domain - Fix deprecated unnest() calls (tidyr syntax update) - Fix Section 8 checkbox matching (left_join instead of bind_cols) - Fix Schedule D NULL handling in dataTable entries - Fix Signature Page undefined variable error - Add conditional checks for optional columns (idLEI, pctClientsNonUS) - Return flat tibbles instead of nested dataTable columns - All 18 Form ADV section scrapers now working Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

abresler · 2026-01-21T21:53:23Z

R/adv_functions.R

      table_name <-
        list('manager',
-             all_data$nameADVPage %>% str_replace_all('\\ ', '')) %>%
+             all_data$nameTable %>% str_replace_all('\\ ', '')) %>%


Bug: Column Name Mismatch

This line references all_data$nameTable, but when the HTML scraping path is used (which is the primary path), the data structure has a column named nameADVPage, not nameTable.

Condition when bug occurs:

section_names has exactly 1 element, AND

assign_to_environment = TRUE, AND

HTML scraping succeeds (not API fallback)

Result: all_data$nameTable returns NULL, causing str_replace_all() to fail.

Suggested fix:

Suggested change

all_data$nameTable %>% str_replace_all('\\ ', '')) %>%

all_data$nameADVPage %>% str_replace_all('\\ ', '')) %>%

Or unify the column naming between HTML scraping and API paths.

View code context

Copilot

Pull request overview

This PR fixes SEC IAPD scraping functionality after the SEC completely rebuilt their Investment Adviser Public Disclosure (IAPD) website from ASP.NET to an Angular Single Page Application. The changes add new REST API integration while maintaining backward compatibility with legacy HTML scraping where available.

Changes:

Added new REST API integration for api.adviserinfo.sec.gov to retrieve basic manager data
Updated HTML scraping to use files.adviserinfo.sec.gov domain where legacy pages still function
Fixed multiple bugs in Form ADV section scrapers including deprecated tidyr syntax, NULL handling, and checkbox matching
Replaced tabulizer package dependency with tabulapdf across the codebase
Added CLAUDE.md documentation file for AI assistant guidance

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
R/adv_functions.R	Major refactor: added 350+ lines of new API integration functions, updated all URLs to files.adviserinfo.sec.gov, fixed NULL handling in idLEI check, fixed deprecated unnest() calls, improved checkbox matching logic, added API fallback for when HTML scraping fails
R/sec_functions.R	Package import update: replaced tabulizer with tabulapdf in 4 locations
R/nareit.R	Package import update: replaced tabulizer with tabulapdf in 4 locations
DESCRIPTION	Updated RoxygenNote version and replaced tabulizer with tabulapdf dependency
NAMESPACE	Updated import statement from tabulizer to tabulapdf
CLAUDE.md	New documentation file providing package overview, architecture, and development guidance for AI assistants

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T21:54:39Z

R/adv_functions.R

+
+  names(data)
+
+  data


This function .parse_crd_json appears to be incomplete or unused. It parses JSON data and extracts the data object but doesn't return it explicitly (line 809 just evaluates data without returning it). Additionally, the function is not called anywhere in the visible changes. Consider either removing this function if it's not needed, or ensuring it properly returns the data with an explicit return(data) statement.

Suggested change

data

return(data)

Copilot · 2026-01-21T21:54:40Z

R/adv_functions.R

            str_replace_all('&', 'AND') %>%
            unique()
        }
+        signature_df <- tibble(nameEntityManagerSignatory = NA_character_)


The initialization signature_df <- tibble(nameEntityManagerSignatory = NA_character_) at line 5968 serves as a fallback, but this means when nodes has length 0 or 1, the function will return a tibble with only NA values without any indication that parsing failed. This could silently mask data extraction issues. Consider adding explicit handling for these edge cases, or at least logging a warning when the nodes length is unexpected (0, 1, or other values not handled by the subsequent if statements).

Copilot · 2026-01-21T21:54:40Z

R/adv_functions.R

+        # Convert pctClientsNonUS to decimal if column exists
+        if ('pctClientsNonUS' %in% names(employee_count_df)) {
+          employee_count_df <- employee_count_df %>%
+            mutate(pctClientsNonUS = pctClientsNonUS / 100)
+        }
+


While this PR fixes one instance of deprecated mutate_at() syntax by replacing it with a conditional check, there are still numerous uses of deprecated dplyr functions like mutate_at() with .funs parameter throughout this file. The mutate_at() function was superseded by across() in dplyr 1.0.0 (2020). For consistency and to avoid future deprecation warnings when these functions are eventually removed, consider creating a follow-up issue to modernize all instances to use mutate(across(...)) patterns.

Copilot · 2026-01-21T21:54:40Z

CLAUDE.md

+
+### Data Access Patterns
+
+Web scraping uses `rvest` with CSS selectors. API calls use `httr`/`curl` with `jsonlite` for JSON. Parallel operations use `furrr::future_map_dfr()` - enable with `future::plan(multiprocess)`.


The reference to future::plan(multiprocess) is outdated. The multiprocess strategy was deprecated in the future package version 1.20.0 (November 2020) and should be replaced with multisession for parallel processing on all operating systems. Update line 75 to recommend future::plan(multisession) instead.

Suggested change

Web scraping uses `rvest` with CSS selectors. API calls use `httr`/`curl` with `jsonlite` for JSON. Parallel operations use `furrr::future_map_dfr()` - enable with `future::plan(multiprocess)`.

Web scraping uses `rvest` with CSS selectors. API calls use `httr`/`curl` with `jsonlite` for JSON. Parallel operations use `furrr::future_map_dfr()` - enable with `future::plan(multisession)`.

Copilot · 2026-01-21T21:54:40Z

R/adv_functions.R

+    if (is.null(api_data)) {
+      return(tibble())
+    }
+


The function .get_manager_info_from_api doesn't validate that api_data$iacontent_parsed exists before accessing api_data$iacontent_parsed$basicInformation on line 114. If iacontent_parsed is NULL or missing, this will cause an error. Consider adding a check similar to what's done in .parse_api_sections_data (line 134) to verify the structure before accessing nested fields.

Suggested change

# Safely handle cases where iacontent_parsed or basicInformation are missing

if (is.null(api_data$iacontent_parsed) ||

!is.list(api_data$iacontent_parsed) ||

is.null(api_data$iacontent_parsed$basicInformation)) {

return(tibble())

}

Copilot · 2026-01-21T21:54:41Z

R/adv_functions.R

+      typeIDSEC = basic_info$iaSECNumberType %||% NA_character_,
+      statusIA = basic_info$iaScope %||% NA_character_,
+      dateADVFiling = basic_info$advFilingDate %||% NA_character_,
+      hasPDF = basic_info$hasPdf == "Y"


Line 123 has a potential issue: when basic_info$hasPdf is NULL, the expression basic_info$hasPdf == "Y" will return logical(0) rather than FALSE, which could cause downstream issues. Consider using (basic_info$hasPdf %||% "N") == "Y" similar to line 153 to ensure it always returns a logical value.

Suggested change

hasPDF = basic_info$hasPdf == "Y"

hasPDF = (basic_info$hasPdf %||% "N") == "Y"

Copilot · 2026-01-21T21:54:42Z

R/adv_functions.R

 .get_manager_sec_page <-
-  function(url = 'http://www.adviserinfo.sec.gov/IAPD/IAPDFirmSummary.aspx?ORG_PK=156663') {
+  function(url = 'https://files.adviserinfo.sec.gov/IAPD/IAPDFirmSummary.aspx?ORG_PK=156663') {
    httr::set_config(httr::config(ssl_verifypeer = 0L))


The call to httr::set_config(httr::config(ssl_verifypeer = 0L)) globally disables TLS certificate verification for all httr requests in the R session, allowing man-in-the-middle attackers with self-signed or invalid certificates to intercept and tamper with HTTPS traffic (including SEC IAPD data). Because this is a global setting, any other httr::GET/POST calls in the same process will also skip certificate checks, undermining the integrity and confidentiality guarantees of HTTPS. Replace this with the default secure verification behavior (or a narrowly scoped per-request configuration) so certificates are validated normally.

Suggested change

httr::set_config(httr::config(ssl_verifypeer = 0L))

…lution Major changes: - Add Ferrara-inspired console messaging system (cli-based) - Fix adv_managers_filings() furrr/possibly namespace bug - Fix pcaob_auditors() for changed PCAOB data structure - Fix cb_unicorns() for new CB Insights city column - Replace rio::import with curl-based .import_url_curl() helper - Remove all scipen global option pollution (23 occurrences) - Update CLAUDE.md with messaging documentation - Modernize deprecated tidyverse patterns across 21 files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings January 21, 2026 21:47

Copilot started reviewing on behalf of abresler January 21, 2026 21:48 View session

abresler commented Jan 21, 2026

View reviewed changes

Copilot AI reviewed Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SEC IAPD scraping after website rebuild to Angular SPA#17

Fix SEC IAPD scraping after website rebuild to Angular SPA#17
abresler wants to merge 2 commits intomasterfrom
fix/sec-iapd-website-rebuild

abresler commented Jan 21, 2026

Uh oh!

abresler Jan 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	all_data$nameTable %>% str_replace_all('\\ ', '')) %>%
	all_data$nameADVPage %>% str_replace_all('\\ ', '')) %>%


		### Data Access Patterns

		Web scraping uses `rvest` with CSS selectors. API calls use `httr`/`curl` with `jsonlite` for JSON. Parallel operations use `furrr::future_map_dfr()` - enable with `future::plan(multiprocess)`.

+    # Safely handle cases where iacontent_parsed or basicInformation are missing
+    if (is.null(api_data$iacontent_parsed) ||
+        !is.list(api_data$iacontent_parsed) ||
+        is.null(api_data$iacontent_parsed$basicInformation)) {
+      return(tibble())
+    }

	hasPDF = basic_info$hasPdf == "Y"
	hasPDF = (basic_info$hasPdf %\|\|% "N") == "Y"

Conversation

abresler commented Jan 21, 2026

Summary

Test plan

Uh oh!

abresler Jan 21, 2026

Choose a reason for hiding this comment

Bug: Column Name Mismatch

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant