ATHE Level 4
How do you harvest data from websites?
DCL starts with a human analysis of the target websites/content by our expert engineering team. We then use tools like our in-house data harvesting software and custom scripts to scrape, harvest, re-structure, and validate the collected data. Special care is taken to ensure harvesting does not overload or accidentally DDOS the target services.
What source and file formats can data be harvested from?
Data harvesting can extract data from HTML, RTF, DOCX, TXT, XML, RSS, XSLX, CSV, and practically every imaginable file format.
What types of data can be collected during data harvesting?
Data harvesting can gather text, metadata, images, videos, and other files from online sources.
Can data harvesting produce structured data in a particular format?
DCL data harvesting can output the data in whatever format is desired. The most common formats are XML, DITA, HTML, and S1000D.
What's the difference between data mining and data harvesting?
Data mining typically refers to analyzing large datasets, often with AI or machine learning, to uncover hidden trends or statistics that traditional analysis methods may miss. Data harvesting is closely related, but focuses on collecting data from online sources so they can be analyzed or reused. Data harvesting and data mining often go hand-in-hand, with harvesting gathering the data to be mined.
How can data harvesting be used for data analytics?
Analytics are only as good as the data analyzed. DCL’s data harvesting services streamline the collection, validation, and structuring process so analytics are faster and more reliable.
Is web scraping the same as data harvesting?
Web scraping is a common term for crawling websites and downloading their contents. At DCL, we differentiate our data harvesting from simple web scraping by also incorporating machine learning and natural language processing to ensure the final output is well structured and ready for reuse. In casual conversation, terms like web scraping, web mining, data scraping, data extraction, and other names are often used interchangeably.
Can DCL harvest data in languages other than English?
Yes, we can harvest data in European and Asian languages.
How is harvested data cleaned and checked for errors?
At every stage of the data harvesting process, DCL uses a combination of human and machine validation processes to verify the quality of the collected data. Our system will flag errors so they can be quickly corrected. High quality, standardized data is DCL’s speciality.