Data Preparation and Engineering | Krishna Neupane, PhD

Architecture of the SEC Data Pipeline: From Unstructured Filings to Longitudinal Insight

In order to assess the compliance rate over the years, a specialized pipeline was developed to convert unstructured SEC EDGAR filings into a clean, longitudinal dataset. This process involved several critical steps to ensure data integrity over the 22-year study period.

Local Inventory and Regulated Acquisition: The researcher engineered the pipeline to follow a “Mirror Rule,” ensuring the retrieval process remained organized and compliant with SEC infrastructure requirements. The “Mirror Rule” ensures the pipeline remains highly efficient while maintaining full compliance with the SEC’s technical infrastructure:

Local Inventory: Due to the strict rate limits of the SEC, the pipeline performs a rapid scan of local storage and bypasses the download if a file with that unique standardized name already exists on the disk. Prior to initiating any external network request, the pipeline performs a recursive scan of its local storage environment. If a file associated with a unique Accession Number is already present, the system bypasses the download phase. This eliminates redundant data transfer, preserving both local bandwidth and federal server resources.
Temporal Symmetry (Rate Throttling): The SEC infrastructure mandates a maximum request frequency to prevent server congestion. The Mirror Rule implements a strict latency between requests to synchronize with the SEC’s operational tempo. This symmetry ensures a stable connection and prevents the abrupt disconnects that can compromise large-batch processing
Identity Headers: To maintain transparent communication, the pipeline includes a mandatory User-Agent string in every request header. Every request includes a “User-Agent” string that identifies the research entity to the SEC servers. Example: SEC_HEADERS = {‘User-Agent’: ‘University Research Project/AI Compliance Study’}. This explicitly identifies the research project and provides contact information to the regulator, establishing a “mirror” of accountability and trust between the researcher and the SEC.
Targeted URL Construction: The system merges a base SEC URL with specific paths found in the master metadata to point directly to the raw filings. Example: It combines the SEC archive link with a path like edgar/data/12345/00011-22-33.txt to locate a specific disclosure.

Two-Stage Narrative Extraction: The researcher utilizes a dual-layered cleaning process to transform raw filings into structured data. The system first isolates the core document by identifying markers like tags. It then strips away internal formatting markers and excessive whitespace, leaving behind only the human-readable narrative. Example: The system extracts a complex, tag-heavy SGML block and refines it into a clean statement of transaction.

Standardized Archival and Indexing: To ensure the dataset is optimized for downstream tasks, the system renames and categorizes every report using a strict naming convention. Every file is indexed by its CIK (Central Index Key), allowing the researcher to pause and resume the process without data loss. Example: A report for a Board Director is archived as f4_20260208_1234567_ID, allowing for immediate retrieval within the “Form 4” subdirectory.

Verifying the Truth: Accession Numbers and Identity

A primary challenge in processing historical filings is “field displacement,” where names and identifiers are often misaligned in the original text.

Accession Number Anchor: The system utilizes the Accession Number as a “truth anchor”. Since the first ten digits are automatically assigned to the filer, the pipeline can verify the identity of the Reporting Owner regardless of internal document disorganization.

Master Verification: Every extracted CIK is cross-checked against the SEC Master Directory. Any identifier that does not match an official entity is excluded to maintain the highest level of precision.

Sorting Giants from Leaders

The pipeline distinguishes between different market participants by identifying corporate markers such as “LLC” or “Corp” within the entity names. This allows the researcher to compare the reporting behaviors of 52,743 Unique Owners against 86,668 Unique Institutions. To ensure computational efficiency, the final data is stored in Parquet format, reducing storage requirements by 70%.

The Conclusion: The Compliance Ceiling

The data pipeline reveals a persistent “Compliance Ceiling” that has remained stagnant for 22 years. This evidence suggests that humans alone cannot overcome the structural frictions of the regulatory environment.

Detailed Data Description

Related Explorations