Nowadays, data is a fundamental resource in the modern banking sector.
Open Data allows organizations to improve the quality of their data sets and to build a competitive advantage based on new data-driven business models. The banking sector is no exception since many different scenarios for the use of Open Data collections can be identified. For instance, the use of macroeconomic data, demographic statistics, market indicators or the characteristics and history of different entities (e.g. companies or enterprises) can be very useful in assessing credit risk or as a support for decision making processes by investment banks.
Automatic data retrieval tools (crawlers) from online sources are an excellent tool for satisfying the information needs manifested by many different entities in the banking sector. Research shows (https://theodi.org/article/using-data-to-take-an-open-approach-to-investment-banking/) that currently the use of Open Data in the banking sector is insufficient and inefficient. Open Data resources are provided by the public entities, communities (the so-called Crowdsourcing), companies and academic circles. As a result, it is possible to identify different Open Data collections with various purposes, structure, context and scope, compare collections' quality or combine many different open data sources together.
There are many different frameworks for building crawlers as well as different sources characteristics. E.g., data might be accessed by API, published as a structured file (e.g., csv), unstructured file (e.g., pdf) or directly on the html website. Each of those methods required different crawler. An important challenge is to find an optimal time span for crawling the same source. It should reflect source variability (how often it is updated) as well as priority of the information needs fulfilled by this source.
Crawlers and Open Data provide means to develop new business models in the banking sector, e.g., in corporate banking or investment banking. Examples of such business models will be discussed during the presentation.
Presenters are authors of the report entitled “New material – Open Data resources for Polish economy, prepared for the Ministry of Economic Development (https://www.gov.pl/web/rozwoj/raport-nowy-surowiec-otwarte-zasoby-danych-dla-polskiej-gospodarki). The results of the project conducted by Aigocracy Institute and The Polish Bank Association will also be discussed during the presentation. Project’s goal was to identify possibilities to use methods from the area of Data Science and Artificial Intelligence in the development of tools for cybersecurity in the banking sector.
With data being the new oil, implementing Data Science has become critical for any organization. In order to successfully create value from data, it is necessary to systematically support the entire Data Science Lifecycle, from discovery, through development to operations in an integrated and applicable way.
A broad spectrum of modern methods and tools have been suggested in the recent years, but Data Science projects still fail. We identified two problems: First, existing methods suggest high-level activities, but what people actually do in daily business is different. Thus, methods lack applicability. Second, methods are incomplete and do not cover the entire Data Science Lifecycle. To tackle the mentioned problem, we suggest to tightly interconnect methods with roles and tools. Furthermore, we propose to add techniques to methods that establish practice and increase applicability by practitioners. In order to close these gaps, we will first present our ongoing effort to design a comprehensive method that supports the entire Data Science lifecycle.
Subsequently, we will demonstrate our work-in-progress for roles in Data Science. One of the omnipresent roles in the equally called discipline is the Data Scientist. Praised as the “the sexiest Job of the 21st century” (Harvard Business Review, 2012), the Data Scientist is treated as a unicorn who can solve all problems on his or her own. We believe that the expectation regarding the competencies of a Data Scientist are unrealistic. With the ever-growing velocity, volume and variety of data, the complexity and the need for parallel processing increases. Data Science is shifting from an individual to a team effort which needs to be orchestrated along the Data Science lifecycle. Therefore, we argue for an existence of different specializations (called roles) of the data scientists. In an interview series with industry partners we identify the following roles: Data Engineer, Data and Business Analyst, Software Developer, Operations Engineer, Product owner, Data Analytics Architect and related governance roles among the Data Scientists. We distinguish these roles by executing and enabling roles. For example, roles like the Data Analytics Architect, Product Owner or Software Developer support and ensure the long-term return of Data Science activities in organizations instead of executing the actual analytics process. Enabling roles further align Data Science teams with the business and IT by understanding the needs for collaboration along the Data Science Lifecycle. The scattered set of specialization has the consequence that companies do not know whom to hire and how to effectively integrate in the business. Additionally, there are not enough Data Scientists on the market and business user lacks Data Science knowledge in organizations. In summary, roles in Data Science and their form of collaboration are not holistically understood along the Data Science lifecycle. To tackle the understanding, we present a comprehensive role concept interconnecting our method for the discovery, the development and the operation of the Data Science Lifecycle. With this concept, we provide guidelines for staffing, connecting and building up Data Science teams.
Finally, tools should be interconnected with methods and roles. For example, Microsoft proposed the Team Data Science Process method extending CRISP-DM. To increase applicability, the Microsoft Azure stack is tightly coupled with this process and offers a broad spectrum of specific tools for the different activities performed within the method. With regards to project management, Scrum is implemented using the tool Azure Boards which has the capability to directly link code with features for implementing entire epics. To do data exploration or machine learning in the cloud, the Azure Data Science Virtual Machine is a pre-configured image to provide dedicated capabilities for data scientists in a team. However, this is a rather Microsoft-centric approach. There are many vendors like Amazon, Google and others which also offer Data Science stacks. Thus, by benchmarking the Data Science tool market, we explore existing tool capabilities on a more generic and conceptual level. We spot possible gaps of required capabilities in tight relationship to methods and roles.