
Researchers at UC Merced, led by Dr. Christian Fons-Rosen, needed to extract structured data from over 10,000 declassified ARPA documents, many lengthy and inconsistently formatted, to study the early internet’s impact on science and innovation. Manual extraction was infeasible, and traditional scripting struggled with the data’s variability.
To automate the process, the team built an AI-powered pipeline using Amazon Bedrock and other AWS services. The solution uses:
- Amazon S3 for storing documents
- AWS Lambda and Amazon SQS for processing and managing tasks
- Claude (via Amazon Bedrock) to extract data fields such as contract numbers and institutions
- Amazon DynamoDB for storing structured results
By leveraging Bedrock’s large language models (LLMs), UC Merced automated complex document parsing without custom model training, saving thousands of hours and enabling faster research insights.
To learn more about the research, visit AWS's article.
