Project Title: Scraping Company Information from D&B
I am looking forward to extracting 140 Million company records from this website: dnb.com/business-directory-sitemapindex.xml
There are a total of 2875 sitemaps and each sitemap contains 50,000 company records.
Here is the given below sample data extract from the following URL: dnb.com/business-directory/company-profiles.pfizer_inc.140f48fa0b37556f925afcaec7b5c566.html
1. Company Name: Pfizer Inc.
2. Company Website: pfizer.com
3. Company Description: Pfizer Inc. is one of the world’s largest research-based pharmaceuticals firm, producing medicines for cardiovascular health, metabolism, oncology, inflammation and immunology, and other areas, with about 10 products that fetch approximately $1 billion or more in annual revenue. Its top prescription products include cholesterol-lowering Lipitor, pain management drugs Celebrex and Lyrica, pneumonia vaccine Prevnar, and erectile dysfunction treatment Viagra, as well as arthritis drug Enbrel, antibiotic Zyvox, and blood-thinner Eliquis. The company also makes and sells generic drugs and consumer health products. Pfizer operates around the world and gets about 55% of its revenue from international customers.
4. DNB URL: dnb.com/business-directory/company-profiles.pfizer_inc.140f48fa0b37556f925afcaec7b5c566.html
Alternatively, you may use Business Directory (dnb.com/business-directory.html) instead of Sitemap. In business directory link, there are several industries, sub-industries and countries. The choice would be yours. How do you want to extract the data.
You may require IP rotation, multiple scrappers, crawlers, concurrent extractions, etc to extract all the records.
The data structure is same throughout all the URLs.
Deliverables: Multiple CSV files
For similar work requirement feel free to email us on firstname.lastname@example.org.