Fix SEC IAPD scraping after website rebuild to Angular SPA#17
Fix SEC IAPD scraping after website rebuild to Angular SPA#17
Conversation
The SEC rebuilt their IAPD website from ASP.NET to Angular SPA, breaking all HTML scraping. This update: - Add new REST API integration (api.adviserinfo.sec.gov) for basic data - Update HTML scraping to use files.adviserinfo.sec.gov (legacy pages still work) - Fix .parse_sec_manager_pdf_url() to use correct domain - Fix deprecated unnest() calls (tidyr syntax update) - Fix Section 8 checkbox matching (left_join instead of bind_cols) - Fix Schedule D NULL handling in dataTable entries - Fix Signature Page undefined variable error - Add conditional checks for optional columns (idLEI, pctClientsNonUS) - Return flat tibbles instead of nested dataTable columns - All 18 Form ADV section scrapers now working Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
R/adv_functions.R
Outdated
| table_name <- | ||
| list('manager', | ||
| all_data$nameADVPage %>% str_replace_all('\\ ', '')) %>% | ||
| all_data$nameTable %>% str_replace_all('\\ ', '')) %>% |
There was a problem hiding this comment.
Bug: Column Name Mismatch
This line references all_data$nameTable, but when the HTML scraping path is used (which is the primary path), the data structure has a column named nameADVPage, not nameTable.
Condition when bug occurs:
section_nameshas exactly 1 element, ANDassign_to_environment = TRUE, AND- HTML scraping succeeds (not API fallback)
Result: all_data$nameTable returns NULL, causing str_replace_all() to fail.
Suggested fix:
| all_data$nameTable %>% str_replace_all('\\ ', '')) %>% | |
| all_data$nameADVPage %>% str_replace_all('\\ ', '')) %>% |
Or unify the column naming between HTML scraping and API paths.
There was a problem hiding this comment.
Pull request overview
This PR fixes SEC IAPD scraping functionality after the SEC completely rebuilt their Investment Adviser Public Disclosure (IAPD) website from ASP.NET to an Angular Single Page Application. The changes add new REST API integration while maintaining backward compatibility with legacy HTML scraping where available.
Changes:
- Added new REST API integration for api.adviserinfo.sec.gov to retrieve basic manager data
- Updated HTML scraping to use files.adviserinfo.sec.gov domain where legacy pages still function
- Fixed multiple bugs in Form ADV section scrapers including deprecated tidyr syntax, NULL handling, and checkbox matching
- Replaced
tabulizerpackage dependency withtabulapdfacross the codebase - Added CLAUDE.md documentation file for AI assistant guidance
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| R/adv_functions.R | Major refactor: added 350+ lines of new API integration functions, updated all URLs to files.adviserinfo.sec.gov, fixed NULL handling in idLEI check, fixed deprecated unnest() calls, improved checkbox matching logic, added API fallback for when HTML scraping fails |
| R/sec_functions.R | Package import update: replaced tabulizer with tabulapdf in 4 locations |
| R/nareit.R | Package import update: replaced tabulizer with tabulapdf in 4 locations |
| DESCRIPTION | Updated RoxygenNote version and replaced tabulizer with tabulapdf dependency |
| NAMESPACE | Updated import statement from tabulizer to tabulapdf |
| CLAUDE.md | New documentation file providing package overview, architecture, and development guidance for AI assistants |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| names(data) | ||
|
|
||
| data |
There was a problem hiding this comment.
This function .parse_crd_json appears to be incomplete or unused. It parses JSON data and extracts the data object but doesn't return it explicitly (line 809 just evaluates data without returning it). Additionally, the function is not called anywhere in the visible changes. Consider either removing this function if it's not needed, or ensuring it properly returns the data with an explicit return(data) statement.
| data | |
| return(data) |
| str_replace_all('&', 'AND') %>% | ||
| unique() | ||
| } | ||
| signature_df <- tibble(nameEntityManagerSignatory = NA_character_) |
There was a problem hiding this comment.
The initialization signature_df <- tibble(nameEntityManagerSignatory = NA_character_) at line 5968 serves as a fallback, but this means when nodes has length 0 or 1, the function will return a tibble with only NA values without any indication that parsing failed. This could silently mask data extraction issues. Consider adding explicit handling for these edge cases, or at least logging a warning when the nodes length is unexpected (0, 1, or other values not handled by the subsequent if statements).
| # Convert pctClientsNonUS to decimal if column exists | ||
| if ('pctClientsNonUS' %in% names(employee_count_df)) { | ||
| employee_count_df <- employee_count_df %>% | ||
| mutate(pctClientsNonUS = pctClientsNonUS / 100) | ||
| } | ||
|
|
There was a problem hiding this comment.
While this PR fixes one instance of deprecated mutate_at() syntax by replacing it with a conditional check, there are still numerous uses of deprecated dplyr functions like mutate_at() with .funs parameter throughout this file. The mutate_at() function was superseded by across() in dplyr 1.0.0 (2020). For consistency and to avoid future deprecation warnings when these functions are eventually removed, consider creating a follow-up issue to modernize all instances to use mutate(across(...)) patterns.
|
|
||
| ### Data Access Patterns | ||
|
|
||
| Web scraping uses `rvest` with CSS selectors. API calls use `httr`/`curl` with `jsonlite` for JSON. Parallel operations use `furrr::future_map_dfr()` - enable with `future::plan(multiprocess)`. |
There was a problem hiding this comment.
The reference to future::plan(multiprocess) is outdated. The multiprocess strategy was deprecated in the future package version 1.20.0 (November 2020) and should be replaced with multisession for parallel processing on all operating systems. Update line 75 to recommend future::plan(multisession) instead.
| Web scraping uses `rvest` with CSS selectors. API calls use `httr`/`curl` with `jsonlite` for JSON. Parallel operations use `furrr::future_map_dfr()` - enable with `future::plan(multiprocess)`. | |
| Web scraping uses `rvest` with CSS selectors. API calls use `httr`/`curl` with `jsonlite` for JSON. Parallel operations use `furrr::future_map_dfr()` - enable with `future::plan(multisession)`. |
| if (is.null(api_data)) { | ||
| return(tibble()) | ||
| } | ||
|
|
There was a problem hiding this comment.
The function .get_manager_info_from_api doesn't validate that api_data$iacontent_parsed exists before accessing api_data$iacontent_parsed$basicInformation on line 114. If iacontent_parsed is NULL or missing, this will cause an error. Consider adding a check similar to what's done in .parse_api_sections_data (line 134) to verify the structure before accessing nested fields.
| # Safely handle cases where iacontent_parsed or basicInformation are missing | |
| if (is.null(api_data$iacontent_parsed) || | |
| !is.list(api_data$iacontent_parsed) || | |
| is.null(api_data$iacontent_parsed$basicInformation)) { | |
| return(tibble()) | |
| } |
| typeIDSEC = basic_info$iaSECNumberType %||% NA_character_, | ||
| statusIA = basic_info$iaScope %||% NA_character_, | ||
| dateADVFiling = basic_info$advFilingDate %||% NA_character_, | ||
| hasPDF = basic_info$hasPdf == "Y" |
There was a problem hiding this comment.
Line 123 has a potential issue: when basic_info$hasPdf is NULL, the expression basic_info$hasPdf == "Y" will return logical(0) rather than FALSE, which could cause downstream issues. Consider using (basic_info$hasPdf %||% "N") == "Y" similar to line 153 to ensure it always returns a logical value.
| hasPDF = basic_info$hasPdf == "Y" | |
| hasPDF = (basic_info$hasPdf %||% "N") == "Y" |
| .get_manager_sec_page <- | ||
| function(url = 'http://www.adviserinfo.sec.gov/IAPD/IAPDFirmSummary.aspx?ORG_PK=156663') { | ||
| function(url = 'https://files.adviserinfo.sec.gov/IAPD/IAPDFirmSummary.aspx?ORG_PK=156663') { | ||
| httr::set_config(httr::config(ssl_verifypeer = 0L)) |
There was a problem hiding this comment.
The call to httr::set_config(httr::config(ssl_verifypeer = 0L)) globally disables TLS certificate verification for all httr requests in the R session, allowing man-in-the-middle attackers with self-signed or invalid certificates to intercept and tamper with HTTPS traffic (including SEC IAPD data). Because this is a global setting, any other httr::GET/POST calls in the same process will also skip certificate checks, undermining the integrity and confidentiality guarantees of HTTPS. Replace this with the default secure verification behavior (or a narrowly scoped per-request configuration) so certificates are validated normally.
| httr::set_config(httr::config(ssl_verifypeer = 0L)) |
…lution Major changes: - Add Ferrara-inspired console messaging system (cli-based) - Fix adv_managers_filings() furrr/possibly namespace bug - Fix pcaob_auditors() for changed PCAOB data structure - Fix cb_unicorns() for new CB Insights city column - Replace rio::import with curl-based .import_url_curl() helper - Remove all scipen global option pollution (23 occurrences) - Update CLAUDE.md with messaging documentation - Modernize deprecated tidyverse patterns across 21 files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Test plan
🤖 Generated with Claude Code