《凯捷(Capgemini):人工智能如何检测海洋健康状况(英文版)(14页).pdf》由会员分享,可在线阅读,更多相关《凯捷(Capgemini):人工智能如何检测海洋健康状况(英文版)(14页).pdf(14页珍藏版)》请在三个皮匠报告上搜索。
1、Ocean Data and AI for species conservation1The problem and at the same time our motivation is the loss of species diversity in the ocean,which is often also referred to as”invisible dying”.With our approach,we pursue the goal of making this dying visible and thus preventable.To make events in the oc
2、ean visible,we need to identify patterns depending on the ocean depth and then recognize deviations from these patterns.With the Norwegian institute Lofoten-Vesteralen,we are analyzing ocean data,to help detect anomalies.This should enable a better understanding of the ocean ecosystem.This should he
3、lp to identify the consequences of human intervention in nature,such as dwindling fish stocks.Abstract.2OCEAN DATA AND AI FOR SPECIES CONSERVATION I OCTOBER 2022Accurate observation of ecosystems enable detailed oceanographic research,allowing anomalies to be identified in enormous amounts of data w
4、ith the help of artificial intelligence(AI).The Lofoten-Vesteralen(LoVe)Ocean Observatory is located west of Hovden Vesteralen in the northern part of Norway.It is located in an ecological,geological,oceanographic and economic”hotspot”.A network of submarine cables and seven sensor nodes covers a cr
5、oss-section from the mainland to the deep sea.It includes a land-based station and seven sensor platforms,covering a gradient from sea level to a depth of 200m.The system continuously provides valuable online data on the marine environment in northern Norway,and has been active since 2013.The system
6、 is both,a national research infrastructure,basic and applied research,as well as a test infrastructure,where industry partners can test new underwater sensors and technologies.The Lofoten-Vesteralen Ocean Observatory has collected over 100 terabytes of sensor data(temperatures,currents,echograms)ov
7、er the years.Thomas Rammis a Software Engineer at Capgemini.He created the initial infrastructure and GitHub integration.Nils Olav Handegaard is a researcher at the LoVe Ocean Observatory.His research focuses on the application of new methodologies and data processing techniques to the fields of mar
8、ine ecology and fisheries oceanography.Majed Alaitwni is a software developer.The focus of his bachelor thesis was interactive visualization for anomaly detection in ocean measurement data.He created the visualization.Daniel Friedmann is a software developer at Capgemini.He is an expert in container
9、ization and Docker and provides content support for the project.Sophie Bader is a molecular biologist specializing in oceanic ecosystems,as well as a software developer.She assisted with the infrastructure.Geir Pedersen is a researcher at the LoVeOcean Ocean Observatory and supports the project on t
10、he Norwegian side.parMustapha Mustapha is a software developer at Capgemini.He has done planning work and helped design the original AI model.Eldar Sultanow is Enterprise Architect Director at Capgemini.His main focus is on modern software architectures,digitalization and enterprise architecture man
11、agement.He developed the code with Thomas Ramm in the initial phase and is now supervisor of the project and oversees research work.Tom Hatton is a Data Scientist,his masters thesis explored the use of unsupervised AI models for anomaly detection in high-dimensional ocean measurement data.He continu
12、es to develop the AI model.3IntroductionThe teamOCEAN DATA AND AI FOR SPECIES CONSERVATION I OCTOBER 2022OceAIn was created as a team name to participate in Capgeminis Global Data Science challenge(GDSC)2021.The goal of OceAIn was to develop an AI model that gains new insight into seasonal correlati
13、ng patterns in ecosystems using the time series data which was collected by ocean sensors.This should help to build better models and understand the climate of our planet.The AI model processes data of the cross section from the mainland to the deep sea.They are collected by four different sensors t
14、hat measure(1)directional pulsating sounds in specific areas with a scientific echo sounder,(2)a so-called hydrophone,i.e.an underwater microphone that records sounds in the environment,(3)an Acoustic Doppler(ADCP)that detects the speed and direction of ocean currents using the Doppler effect,and fi
15、nally,(4)point sensors that provide real-time physical,biological and chemical observations.The Identification of repeating seasonal patterns and anomalies allows scientists to better monitor the marine environment.This involves widespread exploration of the anomalies and their influencing factors a
16、nd drawing conclusions from the bigger picture,such as differences in fish populations,varying current patterns,or the influence of climate change.While large volumes of raw data are difficult to process manually and the results are highly error-prone in the process,AI models allow filtering of this
17、 data for relevant events.In addition,AI enables continuous analyzation of incoming data,resulting in a stream of data to the researchers.Architektur des Systems Even though the AI model is the core of OceAIn,there are other components that make up the platform.Our next goal is to make them work tog
18、ether according to the cloud-native architecture concept,which will create a future-proof and flexible data pipeline.The raw data provided by the institute will be collected in one Table 1.Results of the Baseline ModelsMODELF1 SCOREPRECISIONRECALLMACROAVG F1always true1.000.300.470.23always false0.0
19、00.000.000.41uniform random0.540.400.460.58stratified0.170.210.190.44step and cleaned and transformed in the next.These steps will take place in Docker containers,with each type of data(hydrophone,biomass detection,etc.)having its own container in every step.The collaboration of the containers is de
20、fined inApache Airflow,which works on the”Configuration As Code”principle.Airflow allows the definition of infrastructure using DAGs(Directed Acyclic Graph).It is also worth mentioning that Docker containers are actually managed by Kubernetes,which in turn is defined by Airflow.Finally,the transform
21、ed data is persisted in the form of CSV files.They are processed by the AI to detect anomalies,which are displayed with the data in the form of image files through an interactive web interface.Implementation of the AI model OceAIn includes an AI model that detects anomalies in ocean data.The sensor
22、data for the AI model are highly variable in the type of information,as well as in their duration.In addition,some types of anomalies are only detectable if the data points are considered interconnected from the beginning.The initial idea was to focus on individual models that could handle different
23、 types of data and later combine the separate models.Due to the variety of data,this approach showed some disadvantages,such as increased complexity in aggregating the individual models.4OCEAN DATA AND AI FOR SPECIES CONSERVATION I OCTOBER 2022The idea that ultimately prevailed was to use a deep-lea
24、rning neural network that analyzes all the data in their entirety for anomalies.At first,unsupervised models were supposed to be used,but this approach turned out not to be feasible.The use of unsupervised AI models has the advantage that the training of those models does not rely on the presence of
25、 labeled training data(anomaly vs.normal).However,those models are extremely vulnerable to data noise and corruption 1.The underlying ocean measurement data are also subject to these characteristics,which are further exacerbated by their whole-scale nature and the high data dimensionality that accom
26、panies them.The contribution of labeled data by the researchers has enabled the use of supervised AI models.Basically,supervised AI models outperform their unsupervised counterparts in anomaly detection,as they are particularly capable of detecting application-specific anomalies 2.To check the perfo
27、rmance of the unsupervised models later,four baseline models were created first,the results of which can be seen in 1.These are a common tool in the evaluation of machine learning models and are naive solutions for a classification problem.By comparing the results of a real model with those of the b
28、aseline models,conclusions about the correctness of the real models can be made.Of the baseline models used here,one always classifies false,one always classifies true,one splits the datasets in half between true and false,and the last stratifies the data based on their labels.During the model devel
29、opment,two types of models were created.These are reconstructionbased models and predictive models.For the former,three different microarchitectures were created.The first uses Table 2.Results of all reconstruction modelsRECONSTRUCTIONCLEAN DATAFULL DATAdenselstmconvlstmconvdense W32HL400 40dense W3
30、2HL400 40 v2dense W8 HL40040 20 4lstm W32 HL10040 20 4lstm W32 HL10040 20 4 v2dense W16HL100 40 20 4dense W32HL100 40 20 4dense W32HL2000 1000100 40 20 4dense W32HL400 40dense W32HL400 40 v3lstm W32 HL10040 20 4lstm W32 HL10040 20 4 v2lstm W32 HL40lstm W32 HL400100 10lstm W32 HL6432 16lstm W8 HL1004
31、0 20 4conv W16HL1000 100 10conv W16HL1000 400 1004conv W16HL100 40 20 4conv W32HL100 40 20 4conv W64HL100 40 20 4one-dimensional convolutions in each of the hidden layers,where the amount of filters is determined by a hyperparameter.The second architecture uses so-called LSTM layers and the third is
32、 a fully connected autoencoder.predictive models also fall into three categories.These are also LSTM and fully connected hidden layers,as well as an architecture that uses convolution and max-pooling.The naming of the models follows a fixed pattern.First,it is specified which architectural scheme a
33、model corresponds with.Then a”W”and a number is given,which represents the window size.”W32”thus describes a neural network with a window size of 32.This is followed by further numbers,which indicate the size of the hidden layers.Optionally,there is also a version number at the end,which indicates m
34、odels that had delivered promising results in the first instance,which is why they are run several times.Since the models described above do not in themselves detect anomalies,there is another component that is responsible for exactly that.The results of these models were sobering.Hardly any of the
35、models could exceed an F1 score of 0.5,which means that they were no better than the baseline models with random division of the values.In fact,there was only one model,”lstm W32 HL128 128 128”which could(minimally)exceed this limit.All tested models can be seen in the tables 2 and 3,while 4 shows t
36、he average results of the different types of models.There are several reasons for the comparatively poor results of unsupervised models.The data itself is rather unsuitable for an unsupervised model.Gaps,noise,and data corruption greatly degrade the results of unsupervised models.5OCEAN DATA AND AI
37、FOR SPECIES CONSERVATION I OCTOBER 2022Table 3.Results of all prediction modelsRECONSTRUCTIONCLEAN DATAFULL DATAdenselstmconvlstmconvdense W32HL256*5 v2lstm W32 HL128128 128conv W32HL128 128 128conv W32HL128 128 128 v2conv W32 HL3232 32lstm W32 HL128128 128conv W32 HL3232 32Table 4.Average results o
38、f all model typesMODELQUANTILEF1 SCOREPRECISIONRECALLMACRO AVG F1reconstruction dense clean data0.730.430.340.730.41reconstruction lstm clean dat0.850.410.410.670.4reconstruction conv full data0.850.460.380.680.5reconstruction dense full data0.850.280.320.390.46reconstruction lstm full data0.850.40.
39、330.620.46forecast conv clean data0.850.470.320.90.35forecast dense clean data0.730.480.320.950.33forecast lstm clean data0.730.480.320.980.3forecast lstm full data0.730.470.310.940.31forecast conv full data0.850.470.320.950.336OCEAN DATA AND AI FOR SPECIES CONSERVATION I OCTOBER 2022In addition,the
40、 datasets are highly dimensional,with a large number of sensors collecting information simultaneously.The more complex a dataset is,the harder it is to train AI models to produce correct results.Presumably,our models would produce far better results if the data was less complex.Another issue is the
41、limited computational capacity we have.The compression ratio of the models is high because adding more layers would have required significantly more computation time,which was not feasible.Lastly,it should be mentioned that extensive hyperparameter optimization has not taken place yet.This could mea
42、n that the models themselves are actually better than assumed.For all these reasons,it can be concluded that unsupervised models are unsuitable for anomaly detection in ocean data as it is generated by LoVe.Thus,OceAIn shall use a model that learns in a supervised manner.This model will later be con
43、tinuously retrained based on data generated by the researchers during operation.A snippet of code for the model is shown in listing 1.Here,the datasets that deviate significantly from the expected normal are detected.This is done by first iterating over all the datasets,while all datapoints that are
44、 detected as deviations are stored in the array anomalousdataindices.1#Detect all the samples which are anomalies.2anomalies=test_mae_loss threshold3print(Number of anomaly samples:,np.sum(anomalies)4print(Indices of anomaly samples:,np.where(anomalies)56plt.plot(x_test0)7plt.plot(x_test_pred0,alpha
45、=0.7)8plt.show()910anomalous_data_indices=11for data_idx in range(TIME_STEPS-1,len(X_test)-TIME_STEPS+1):12if np.all(anomaliesdata_idx-TIME_STEPS+1:data_idx):13anomalous_data_indices.append(data_idx)14anomalous_data_indices15Listing 1.Code snippet of the anomaly detection module7OCEAN DATA AND AI FO
46、R SPECIES CONSERVATION I OCTOBER 2022Visualization of the vast amounts of dataThe visualization of the results generated by the AI models and the associated raw data is done in the form of a web application,whose structure can be seen in the form of a class diagram in Figure 1.The Visualizer class d
47、epends on the External Framework ThreeJS,and is responsible for generating the image files from the CSV files.The Visualizer is also extended with a separate component for each type of sensor data.The class”App”,in turn,is responsible for initializing the sensor classes and processing user input.On
48、the server side,Python and Docker are used,while various packages are provided that contain the functionality for data synchronization and the creation of measurement data representations.On the client side,when executing the web documents sent by the server,control elements are created in the brows
49、er,which convert the users interactions into corresponding requests.Javascript and the library Three.js are used for this purpose.The actual visualization of the measurement data on the client side is done as following:There are four canvas elements arranged as shown in figure 2.They contain the vis
50、ualized data of different sensors,where each element can be moved on the x-axis,and thus in the temporal space.Movement on the y-axis,on the other hand,you move in a sensor specific space.At the top left,the data from the EK60 sensors are evaluated.These sensors can detect how much biomass is presen
51、t at a certain location.This is visible through the color gradient of the generated image.The brighter an area is,the more biomass was detected at the corresponding location.Vertical movement adjusts the depth of the captured data.At the top right,acoustic signals recorded with hydrophones are displ
52、ayed.Here,only a shift on the X-axis is possible.At the bottom left,from top to bottom,flow velocity,strength and direction are displayed.See also 3.Just like the EK60 sensors,the vertical position determines the depth of the acquired data,while the brightness of the colors determines the speed or i
53、ntensity of the currents.Finally,at the bottom right are the point sensors.These capture different data,such as water composition,salinity,or the presence of certain chemicals,and display them as graphs.ThreeJsUserinputHandlerVisualizerEK60HydrophonADCPPointsensorsAppExtendsOwnsOwnsFigure 1.Overview
54、 of the main component of the client-side applicationThe visualization of the results generated by the AI models and the associated raw data is done in the form of a web application,whose structure can be seen in the form of a class diagram in Figure 1.8OCEAN DATA AND AI FOR SPECIES CONSERVATION I O
55、CTOBER 2022On the server side,a conversion of the measurement data from CSV first into JSON and then into image formats(such as PNG,JPG and JEPG)takes place.In this process,only the truly important data is visualized.Thus,the unimportant part is ignored from the start,which means that only a fractio
56、n actually needs to be displayed.The conversion of the measurement data into image formats allows a reduction of the sent data to the client,so that large time intervals can be visualized and all measurement data information can be displayed.On the client side,the representations created on the serv
57、er side are retrieved and displayed by the client-side components.Figure 2 shows the implemented user interface,which displays a section of the measurement data of the EK60 sensors,hydrophones,ADCP sensors and point sensors for a period of three days.Each of the four areas can be controlled by a con
58、trol bar.The X-and Y-positions of the mouse pointer in the visualization area are interpreted by the client without making any requests to the server,so that the X-position leads to the display of the time stamp for the measurement data displayed below,and the Y-position leads to the display of the
59、Y-axis values of the measurement data.For example,the EK60 and ADCP display the depth values.The ADCP measurement data is also used to calculate the currents,which makes it possible to establish correlations between water currents and the distribution and movement of biomass from the EK60 measuremen
60、t data.This is illustrated in figure 3.Figure 2.The GUI for visualizing a section of the measurement dataof all sensorsFigure 3.Visualization of ADCP measurement data in fullscreen mode.Figure 3.Visualization of ADCP measurement data in fullscreen mode.Figure 2.The GUI for visualizing a section of t
61、he measurement data of all sensorsFigure 3.Visualization of ADCP measurement data in fullscreen mode.9OCEAN DATA AND AI FOR SPECIES CONSERVATION I OCTOBER 2022Listing 2 shows the function tostreamplot,which is responsible for the calculation of the flow data.The variable plt represents the external
62、Python library Matplotlib.In this function the ADCP datapoints are supplied as parameters.With the help of the Matplotlib function streamplot a flow diagram is generated,which shows the calculated ocean currents.The diagram is saved as an image file,from where it is later displayed on the web interf
63、ace.1def to_streamplot(X,Y,u,v,target_file_path,config):2_,ax=get_fig_ax()3ax.streamplot(X,Y,u,v,linewidth=configlinewidth,4color=u,density=configdensity,5cmap=configcmap,6arrowstyle=configarrowstyle)7plt.savefig(target_file_path,8format=str(target_file_path.suffix).replace(.,),9bbox_inches=tight,pa
64、d_inches=0.0,transparent=True)10return target_file_path11Listing 2.Code for the calculation of the currents for the visualization10OCEAN DATA AND AI FOR SPECIES CONSERVATION I OCTOBER 2022ConclusionThe OceAIn platform allows marine biologists at the LoVe Ocean Observatory to directly access subject
65、relevant events from the wealth of collected data.This saves a large amount of manual data analysis,with this saved time instead being used to drive research.11OCEAN DATA AND AI FOR SPECIES CONSERVATION I OCTOBER 2022OutlookThe OceAIn project shows how collaboration between marine biology and IT can
66、 help to protect species and increase the understanding of our environment.However,our work on this project is far from complete.There are several planned features on our roadmap.First,the entire structure,from data collection to processed results will be automated,using containers as described at t
67、he beginning of this article.Another goal is to enable continuous and automated retraining of the AI based on the manually marked anomalies.The visualization itself is also not yet complete.Here,for example,there are possibilities for extending the display of the data or the UI features.In addition,
68、the completed website should soon be active continuously,so that it can be used by researchers at the LoVe Institute.12OCEAN DATA AND AI FOR SPECIES CONSERVATION I OCTOBER 2022References1 Raghavendra Chalapathy and Sanjay Chawla.Deep learning for anomaly detection:A survey.arXiv preprint arXiv:1901.
69、03407,2019.2 Charu C.Aggarwal.Time series and multidimensional streaming outlier detection.In Outlier Analysis,pages 273310.Springer,2017.MAJED ALAITWNI,Capgemini,Leipziger Strae 12,Hannover,Deutschland Majed.AlaitwniCSHIRIN MIAN,Organization,Erzbergerstrae 117,76133 Karlsruhe,Deutschland Shirin.MEL
70、DAR SULTANO,Capgemini,Bahnhofstrae 30,90402 Nurnberg,Deutschland Eldar.S13OCEAN DATA AND AI FOR SPECIES CONSERVATION I OCTOBER 2022Copyright 2022 Capgemini.All rights reserved.About CapgeminiCapgemini is a global leader in partnering with companies to transform and manage their business by harnessin
71、g the power of technology.The Group is guided everyday by its purpose of unleashing human energy through technology for an inclusive and sustainable future.It is a responsible and diverse organization of over 325,000 team members more than 50 countries.With its strong 55-year heritage and deep indus
72、try expertise,Capgemini is trusted by its clients to address the entire breadth of their business needs,from strategy and design to operations,fueled by the fast evolving and innovative world of cloud,data,AI,connectivity,software,digital engineering and platforms.The Group reported in 2021 global revenues of 18 billion.Get the Future You Want|