logo

Data Acquisition

In Data Science, the most difficult and important part is undoubtedly acquiring data and preprocessing it. Unfortunately, there are few classes that teach what data exists in this world and how to obtain it.

MarkClassification
😀Good Accessibility
😡Poor Accessibility
🔰Beginner Recommended
👨‍🎓Expert Recommended
👍Highly Recommended
  • Giving 👍 indiscriminately loses meaning, so it is given to only one per 10 posts to maintain rarity.

Looking at just the marks, it might seem like you should just follow 👍, but the world of data isn’t that straightforward. The data that perfectly matches what you want to do usually doesn’t exist in the world, and if there’s a shortage, you often have no choice but to accept the regret and use what’s available. Similar data with many alternatives is better, and even bad data is better than nothing.

Structured Data

Time Series

Korean Local Governments

  • D-Data Hub 😀: Provides public data for the Daegu region. Provides over 4,000 datasets and over 13,000 services. (2021.06.08)
  • Changwon City Big Data Portal: Provides 12 categories and 172 datasets from the Changwon region, as well as services like big data studios and business analysis. (2021.07.30)

Unstructured Data

  • AI Hub 👨‍🎓: Provides data for AI training. Covers various formats such as images, videos, text, audio, 3D, and sensor data in fields including speech/natural language, vision, healthcare, autonomous driving, safety, agriculture/fisheries, national land and environment, and education. (2021.07.14)
  • kaggle 😀🔰: The world’s most famous open data hub, offering countless diverse datasets and hosting many smaller competitions. (2021.07.15)
  • KDX Korea Data Exchange 😡👨‍🎓: Unlike typical data hubs, it’s a company that sells data for a fee. Being paid, it maintains the highest level of quality and quantity of data suited to Korea’s circumstances, and many free datasets are also available for sale. (2021.08.06)

Network

  • SEES:lab 👨‍🎓: Network data for airports, email, etc. is cleanly organized. (2021.12.31)
  • Stanford Network Analysis Project 👨‍🎓: A network analysis/mining library maintained by Stanford University that provides what can be called massive network data. (2022.01.04)
  • OpenFlights: Provides data on world airports and aviation networks. While requiring some preprocessing, surprisingly, network data of this scale is uncommon. (2022.01.10)
  • Mark Newman Network Data 😡: You can access the legendary Mark Newman’s network datasets. 23 types of networks related to published research are available. (2022.01.10)
  • World Pop: Provides data on global aviation networks, international migration statistics, urbanization, age and gender structure, and more. (2022.01.04)
  • Web of Life: Provides ecosystem network data for parasitism, mutualism, predation relationships, and more. (2024.07.30)
  • Network Data Repository: Provides thousands of diverse networks across over 30 topics. (2024.08.01)

Geographic Information

  • ITS National Traffic Information Center 😀👨‍🎓: Provides domestic traffic flow, construction accidents, CCTV, traffic forecasting, vehicle sensors, VMS, traffic safety assistance, variable speed signs, vulnerable section information, and nationwide standard node links. (2021.08.03)
  • 👍 GIS DEVELOPER 👨‍🎓: A blog run by Kim Hyung-jun, a GIS expert and developer. It’s no exaggeration to say that for any project using Korean data, it’s impossible to accomplish anything without this person’s help. (2023.01.10)
  • Administrative Standard Code Management System: Not specific geographic information, but provides the most important list of ’legal district codes’ corresponding to geographic information. (2024.08.23)
  • ISO3 by Country and Latitude-Longitude Data: Can obtain needed ISO codes and latitude-longitude for the entire world. (2025.05.16)

All posts