Based on a description of the business case presented in part 1, this post continues with a selection of relevant public datasets to enhance data repositories dealing with lifecycle phases of connected cars. Specifically, we look at those datasets that cover both Automotive Manufacturing and Connected Cars, and address two perspectives: multi-stakeholder and cybersecurity & safety (see Figure below).
The multi-stakeholder perspective brings together different stakeholders with their various roles and interests to exploit an ecosystem, e.g. car designers and manufacturers, road traffic and weather condition services, road services (including highways, crossings, parking areas), buildings (and bus stops), public vehicles (busses and trams), cyclists, pedestrians, etc. Furthermore, to ensure continuous secure and safe conditions in such an ecosystem, the data must be collected and processed in a controlled and responsible way that adopts data protection (e.g. data minimisation, data anonymisation/ pseudonymization, GDPR) and data security methods (e.g. data encryption, data integrity).

The selected cybersecurity- and multi-stakeholder-centered data model for Digital Twins, is shown above and includes the following datasets addressing entire connected car’s lifecycle phases.
Datasets for the initiation phase.
- The milling datasets are acquired from the experiments on a milling machine with different speeds, feeds, and depth of cut. Available from: https://ti.arc.nasa.gov/c/4/
- The Mercedes-Benz Greener Manufacturing dataset includes an anonymized set of variables, each representing a custom feature in a Mercedes car. Available from: https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/data
- The NIST manufacturing robotics testbed includes robotics data for advanced manufacturing and material handling. Available from: https://catalog.data.gov/dataset/manufacturing-robotics-testbed-c184f
- The pollution dataset with value measurements of various pollutants: lead, carbon, nitrogen- dioxide, ozone, etc. Available from: https://www.kaggle.com/ruvenguna/all-data-to-use#total-output-in-manufacturing-by-industry-annual.csv
Datasets for the operational phase.
- The Berkeley DeepDrive BDD100k is one of the largest and most diverse datasets for self-driving cars, containing over 100K videos of driving experiences enhanced by geographic, environmental, and weather diversity. Available from: https://bdd-data.berkeley.edu/
- ApolloScapes is a large dataset collected from 26 types of stakeholders, e.g. cars, bicycles, pedestrians, buildings, etc. Available from: https://github.com/ApolloScapeAuto/dataset-api
- The KITTI dataset includes online benchmarks for visual odometry, image tracking, and semantic segmentation. Available from: https://www.kaggle.com/kerneler/starter-kitti-vehicles-96b4f3f8-8
- The Traffic, Driving Style and Road Surface Condition datasets include attributes for predicting road surface, traffic and driving style. Available from: https://www.kaggle.com/gloseto/traffic-driving-style-road-surface-condition
- Automotive Sensor Data collected during 35 trips, conducted by one driver driving one vehicle. Available from: https://bit.ly/2WCZknd
Datasets for the maintenance phase.
- Production Plan Data for Condition Monitoring includes features of several components for which the predictions need to be performed. Available from: https://bit.ly/2WCZknd
- The pollution dataset, available from: https://www.kaggle.com/ruvenguna/all-data-to-use#total-output-in-manufacturing-by-industry-annual.csv
- Industrial dataset for Software Fault Prediction (SFP), e.g. NASAMDP (http://mpd.ivv.nasa.gov) and PROMISE (http://promise.site.uottawa.ca/SERepository/).
End-of-life phase.
Unfortunately, public datasets related to end-of-life and recycling are missing at this stage.
Datasets to support cybersecurity and safety validations.
- ADFA Intrusion Detection Datasets collect data related to the Host-based Intrusion Detection System (HIDS) evaluation. Available from: https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-IDS-Datasets/
- The Cyber Research Center Datasets (ITOC CDX) provide a comprehensive set of log data collected during attacks. Available from: https://drive.google.com/uc?id=0B0u9Tg7udaAXaUFHRFpQWjR0dW8&export=download
- The NSL-KDD Benchmark Dataset helps researchers to compare different intrusion detection methods. Available from: https://www.unb.ca/cic/datasets/nsl.html
- The DARPA Intrusion Detection Datasets include an offline evaluation and a real-time evaluation, based on a larger sample of training data about network-based attacks. Available from: https://archive.ll.mit.edu/ideval/data/1998data.html
The next step would be about to aggregate selected datasets and to prepare the data for analytics.