Upstream oil and gas is the segment of the petroleum industry that finds and extracts crude oil and natural gas via a network of wells and pumps. Some of the major challenges to this industry are the inefficiencies and equipment failures that occur when wells are operated in a less than optimal way as well as the production that is temporarily stalled as parts and equipment fail and await repair. A successful predictive maintenance strategy can boost the well oil and gas production rate and revenue by alerting well operators to the opportune moments for a graceful shut down for maintenance and repairs before catastrophic equipment failure happens.

In this article, I develop a toy model simulation to assess the potential benefits of a predictive maintenance strategy applied to upstream oil and gas.

Predictive Maintenance

Predictive maintenance (PdM) is the use of data and algorithms to optimize high-value machinery that manufacture, transport, generate, or refine products. Typically, one outfits the high-value machines (robots on the factory floor, network of oil/gas wells, fleet of vehicles, etc.) with numerous sensors that emit telemetry, which is the collective stream of measurements that each machine’s many sensors emit over time. That telemetry characterizes each machine’s state (i.e., temperature, pressure, rpm, operational settings, etc.) at all moments during its operational history. The expectation is that a machine learning algorithm, when trained on historical telemetry emitted by many related machines over time, can then alert the machine operator with sufficient time to send a repair technician to affect a repair that averts catastrophic equipment failure at later times. That ML model, known as a remaining useful lifetime (RUL) model, is a textbook classic, quite easy to build, and can achieve high accuracy when the historical telemetry is both abundant and diagnostic. However, an RUL model built only on telemetry data is still fairly useless, as it merely tells a machine operator when a given machine is likely to fail. It does not indicate which particular failure mode is looming (e.g. broken valve, coolant leak, or seized bearing, etc.), so the machine operator does not know how to mitigate the pending failure.

A successful PdM strategy also requires an additional key piece of information: the machines’ repair logs. When a high-value machine suffers a failure, that machine’s production is halted until a repair technician arrives at the machine, diagnoses its failure mode, repairs the machine, and returns the machine to production. PdM also requires two key facts to be recorded in that machine’s repair log: the failure timestamp, as well as the repair technician’s diagnosis, which is an abbreviated label for a widely-understood sequence of repair steps that result in specific parts being cleaned/repaired/replaced/tested. Examples include a "brake job" which on an automobile would replace a wheel’s pads, rotors, and caliper, while a "tune up" replaces any worn elements in an auto’s fuel system, spark plugs, various valves and filters, and coolant as needed.

The data scientist who wishes to develop a PdM strategy for a suite of machines would use the timestamp columns in the above data to join the machine’s historical telemetry to their repair logs. With that joined dataset, which now includes historical timestamped telemetry as well as a labelled diagnosis column, the data scientist can now compute, for each machine at every moment, the time until a given machine will suffer any of the noteworthy failure modes listed in the diagnosis columns. For each of those noteworthy failure modes (which are designated as such due to frequency, hazardousness, or expense), the data scientist could then build an RUL regression model—one for each noteworthy failure mode—to predict when a given machine will suffer a particular failure. But even simpler models are possible, such as a binary classifier trained to predict whether a machine will or will not experience a given failure-mode some time interval Δt into the future. And because a simpler binary classifier tends to be easier to build/test/deploy than a more complex regressor, a suite of binary classifiers will be used in the PdM simulation that are detailed below.

Toy Model

One challenge facing PdM practitioners is a lack of data. Although there are a handful of telemetry samples available on the internet, there are none (that I’m aware of) that also include repair data. Additionally, petroleum firms are not motivated to share any of their telemetry and repair data, so unless you are a data scientist with access to your firm’s in-house telemetry/repair data, the only other recourse is to generate mock data on which to prototype a PdM strategy. This is done below with a toy model for upstream oil & gas.

A toy model is a simplified set of rules or equations that are sufficient to describe and understand the key properties of a more complex object. A significant drawback of the toy model approach developed below is that it will be ignorant of the fluid dynamics, engineering, and geophysical principles that a real oil/gas well operates by. But a toy model’s great advantage is that, by altering model parameters and rerunning, one can swiftly explore a suite a PdM strategies, whereas a huge amount of time, money, and effort would be required to build and execute a physics-based model of oil/gas wells.

RTF Simulation

The toy-model oil and gas simulator and Jupyter notebooks discussed below will soon be open-sourced and available online, and this posting will link to that code when available. The initial toy-model simulation of upstream oil and gas is executed at the command line via


which passes the simulation’s various input parameters stored in over to the script which uses those parameters to generate the simulation’s mock telemetry and repair data. Take a look at for a brief description of the simulation’s many parameters. The most important input parameters are

N_devices = 1000
N_timesteps = 50000
strategy = 'rtf'
N_technicians = 100
repair_duration = 100
sensor_sigma = 0.01
pdm_threshold_time = 400
maintenance_duration = 25

which indicates that the PdM simulator will simulate the evolution of 1,000 virtual devices aka wells for 50,000 timesteps. A timestep is a single tick of the simulation’s virtual clock, and all well properties are updated every timestep. That strategy='rtf' indicates that the PdM simulator is executed in run-to-fail (RTF) mode, which means that all wells operate and produce petroleum until they suffer a fatal issue, at which point production ceases until one of the N_technicians=100 available technicians arrives at the well, diagnoses the issue, and then repairs the well, all of which takes repair_duration=100 timesteps before the well resumes production. The pool of maintenance technicians is a limited and valuable resource; if too many wells fail too rapidly, then the pool of technicians becomes oversubscribed and failed wells experience greater production losses while waiting for by the next available service technician.

The following shows the production efficiency experience by one simulated well, plotted versus time:

productionFigure 1: Plot of simulated well #123’s production efficiency over time. Small drops in production are due to the accumulation of virtual crud, while outages are due to the indicated fatal issue.

This simulated well’s production efficiency starts at unity but slowly degrades over time until production is interrupted by one of three mock issues: cracked_valve, broken_gear, and jammed_rotor. A real oil/gas well is of course prone to suffer a much wider variety of fatal and non-fatal issues, but the toy-model approach used here only attempts to simulate the consequences of three make-believe fatal issues. The rate at which the virtual wells suffer those mock issues is controlled by the coefficient parameter found in the issues dictionary in,

issues = {
    'crud':         {'ID':0, 'coefficient':0.100000,   'fatal':False},
    'jammed_rotor': {'ID':1, 'coefficient':0.000080,   'fatal':True },
    'cracked_valve':{'ID':2, 'coefficient':0.000010,   'fatal':True },
    'broken_gear':  {'ID':3, 'coefficient':0.000002,   'fatal':True },

and any changes to any coefficient will increase or decrease the frequency at which a well experiences the corresponding issue. Also note that the crud issue is non-fatal. Rather, it is a virtual substance that a well accumulates while operating and over time causes the modest production drops seen in Fig 1.

Execution of this RTF simulation takes about 15 minutes to complete, and two compressed output files are written, as well as the telemetry emitted by the wells' sensors and all repair logs performed by the service technicians. To inspect the first few lines of the 25 million-record telemetry data,

gunzip -c data/telemetry_rtf.csv.gz | head -20

which will yield something like


which tells us that well #1 had a temperature T=0.02476 and a production_efficiency=0.9971612 at time=1. Each virtual well has three virtual sensors that measure its pressure P, temperature T, and load L, and at every simulation timestep these quantities random-walk away from their optimum setting where (P,T,L)=(0,0,0), which is referred to as the well’s sweet spot since those are a well’s optimum settings where production is maximal and issues are rarest. The rate at which a well’s (P,T,L) settings vary over time is controlled by the sensor_sigma model parameter that is set in, which is 0.01 in this simulation, and changes to that parameter alters the rate at which each well random-walks. That random-walk is illustrated in the pseudo 3D plot below showing the trajectories of three wells as their settings random-walk across the (P,T,L) parameter space and away from their sweet-spot at +. 


Figure 2: Trajectories of 3 simulated wells as they random-walk across the (P,T,L) parameter space.

The other output generated by the RTF simulation is the virtual well’s repair log; that compressed file contains about 40 thousand records that can be inspected via

gunzip -c data/repairs_rtf.csv.gz | head

which will yield something like

141| 46|jammed_rotor|29|-0.023390762068|-0.22793915479|-0.074207414842|0.97591
149|941|jammed_rotor|77| 0.073194796329| 0.17941953011|-0.084087349562|0.97887
161|882| broken_gear|33|-0.427060000973| 0.02374669927| 0.271239440230|0.94935

which tells us that at time=141, deviceID=46 failed, and that technicianID=29 had arrived at that well to diagnose its failure as being due to a jammed_rotor, with the remaining fields recording that well's P,T,L and production_efficiency at the moment of failure. And when that technician’s repairs are completed at repair_duration=100 timesteps later, that well’s settings are then returned to its sweet spot with any crud removed, with production resuming at 100% efficiency. 

It should also be noted that this toy model treatment of well failures does not track the parts and labor costs incurred at each repair. Equipment and labor costs are likely very significant, but they are ignored here, for now at least, in order to keep this first-draft toy model of PdM as simple as possible so that we can swiftly assess its main findings.

Aside: develop, debug, and visualize with Jupyter

All simulation visuals are generated via python code executing inside a Jupyter notebook, and to see those plus many more plots of the simulated RTF output, start Jupyter and execute the inspect_rtf.ipynb notebook. Jupyter is an excellent tool for developing and debugging code, especially when one spot-checks each paragraph of code by displaying variables or plotting arrays inline, which accelerates code development by shortening the debugging time.

Assessing RTF output

The probability that a well suffers any of the three fatal mock issues is controlled by the well’s location in the (P,T,L) parameter space, with wells by design being more likely to suffer the cracked_valve issue as P becomes large, more likely to experience a jammed_rotor when T becomes large, and more likely to suffer a broken_gear when L, |P|, or |T| become large. Another quantity of great interest to a well operator will be the production efficiency averaged across all 1,000 simulated wells:

avg_productionFigure 3: Average production efficiency for 1000 simulated wells.

which quickly settles down to an equilibrium value of 87.5% (green line). That the wells’ average efficiency is significantly short of 100% is partly due to the mock crud that accumulates as wells operate, as well as the downtime that results when wells suffer intermittent fatal issues that require time to repair. Another quantity of interest is the maintenance technicians’ mean utilization,

avg_tech_utilizationFigure 4: Fraction of the 100 virtual technicians that are performing repairs on failed wells, versus time.

with the orange curve telling us that on average 81.6% of the 100 virtual repair technicians are performing repairs on failed oil/gas wells.

Binary Classifiers for PdM

The next task is to build a binary classifier model, one for each fatal issue, with those models trained on the virtual well’s telemetry and repair logs that were generated during RTF simulation. To build, train, and test those binary classifiers, execute the build_models.ipynb notebook, and scan the notebook comments for details on intermediate steps (e.g. joining the telemetry and repair log data on timestamp & well-ID, computing each well’s time-to-next-issue, splitting the data into training and testing samples, training a random forest classifier on the training data, using the testing sample to assess the model accuracies, and then storing the PdM models). Each PdM classifier is designed to output a True or False to indicate whether the model thinks that a given well will experience the corresponding fatal issue during the next pdm_threshold_time=400 timesteps, as well as a confidence score that scales with the model's internally-assessed accuracy; that confidence score ranges from 0.5 (i.e., the model prediction is just an educated guess) to 1.0 (i.e., the model is extremely confident that its prediction is correct). To quantify model accuracy, the following shows the models' False positive rate versus the models' confidence scores:


Figure 5: The three PdM classifiers’ effective false positive rate versus model confidence score.

which tells us that when a prediction's confidence score exceeds 0.5, then the model will incorrectly flag a healthy well as being in danger of failing almost 14% of the time, but note that that error rate drops to less than 0.2% when the model confidence is near unity. These False positives negatively impact the wells' average production efficiency, since a False positive sends a well into maintenance prematurely, which also increases the load on the limited pool of virtual technicians.

The other possible erroneous model prediction is a False negative, and the models' False negative rate versus model confidence score is

pdm6Figure 6: The PdM classifiers’ effective false negative rate versus model confidence score.

False negatives also reduce the benefit of PdM, since the model does not detect a well's imminent failure, and if that persists across subsequent simulation timesteps then the well will suffer a catastrophic failure, which reduces the well’s production efficiency due to outage.

PdM Simulation

Now that the PdM models have been built, execute the simulation again but in PdM mode:


which calls the same script that instead reads the parameter file that has settings that are identical to that used earlier ( except that strategy='pdm', which tells to use the three PdM classifiers to predict over time whether a given well will suffer any of the fatal issues during the subsequent pdm_threshold_time=400 timesteps. Those wells whose predictions indicate that a fatal issue is pending are then sent to preventative maintenance where they are serviced by the next available are technician. This PdM simulation also sets maintenance_duration=25, so wells receiving preventative maintenance only spend one-quarter the time undergoing repairs as any well that manages to suffer a fatal failure, and this setting is intended to mimic the expected benefit of an orderly shutdown and repair, whereas wells that do suffer an uncontrolled fatal fail are assumed to experience greater damage that requires spending 4x longer being unproductive while being repaired.

The PdM simulation takes about 15 minutes to complete, and the simulated wells' sensor telemetry and repair logs are stored in telemetry_pdm.csv.gz  and repairs_pdm.csv.gz . To visualize that output and to assess the benefits of the PdM strategy simulated here, execute the inspect_pdm.ipynb notebook to generate this plot

pdm7Figure 7: The wells' mean production efficiency (blue) versus time, and the technician utilization fraction (green).

showing the wells’ mean production efficiency (blue) whose time-average is 89.1%, and the maintenance technicians’ utilization (green) whose time-average is 73.9%. Comparing these PdM numbers to that obtained earlier during the run-to-fail simulation shows that the PdM strategy developed here boosted the virtual wells’ output by only 1.6%. This in fact is the main lesson to be drawn from this simulation: that PdM for upstream oil and gas is difficult and, after much effort, might only generate a seemingly meager boost in productivity. Interestingly, the technicians' utilization was 8% lower when PdM is used, which suggests that the greatest benefit of PdM for upstream oil and gas might instead be recognized as a significant reduction of the workload that the repair technicians experience, rather than as a boost in production.

Sensitivity to Model Parameters

This toy-model simulation of PdM has many adjustable parameters, and a survey of a suite of various simulation parameters shows that the ratio of the time that a well spends in preventative maintenance (the maintenance_duration parameter) to the time spent recovering from a catastrophic fatal issue (repair_duration parameter) has the greatest impact on PdM outcomes. PdM sensitivity to the ratio of those two parameters is explored in the simulations/maintenance_duration/production.ipynb notebook that generates this plot of the PdM-managed well’s production efficiency versus that ratio:

pdm8Figure 8: Wells' mean production efficiency versus the ratio of the maintenance_duration/repair_duration parameters

with the above showing that a well operator only sees production boosted by PdM when maintenance_duration < 45% of repair_duration. But note that even in an extreme case, when downtime due to preventative maintenance is only 5% of that due to catastrophic failure, the resulting production boost is still only about 3%. Nonetheless this simulation survey also shows that the technician's workload can be significantly lessened when maintenance_duration/repair_duration gets small:


Figure 9: Mean technician utilization versus the ratio of the maintenance_duration/repair_duration parameters

with technician utilization as low as 50% when maintenance_duration = 2% of repair_duration.

Main Findings

  • A toy-model simulation is developed and used to assess the impact of a predictive maintenance applied to upstream oil and gas. The model's key features include code that generates mock sensor telemetry plus repair logs emitted over time by numerous virtual oil & gas wells that produce petroleum while also suffering occasional failures, with productivity losses accounted for as failed wells wait to be repaired by a pool of virtual repair technicians. The simulated wells are first evolved using a run-to-fail maintenance strategy, and machine learning models are then trained on the RTF telemetry + repair output to predict whether a given well will suffer a particular fatal issue some time interval Δt hence. Then the simulated wells are evolved again in PdM mode, which uses those ML models to preferentially repair those wells likely to fail in time Δt hence. Jupyter notebooks are also used to inspect simulation output and to quantify the production boost that results from PdM.
  • When the PdM strategy is applied to the toy-model simulation of upstream oil and gas, only very modest gains in well productivity are achieved, about 1-3%.
  • Note that a 2% production boost by a firm producing a million barrels of oil/day worth $50/barrel will still see its revenue boosted by roughly a million dollars/day, so PdM for upstream oil and gas is nonetheless worthwhile.
  • Additional modest boosts in well productivity can likely be achieved with additional improvements in the accuracy of the ML models.
  • These toy-model simulations also show that a PdM strategy can significantly reduce the workload experienced by the pool of virtual technicians that maintain and service those wells, with simulations showing workload reductions of 10-30% being possible.
Joe Hahn
Joe Hahn

Joe is a data scientist at Oracle’s Information Management (IM) team, and he specializes in delivering machine learning, analytics, and data visualization for customers in the Oracle Cloud. Joe received a PhD in physics from the University of Notre Dame, and also has many years experience performing astronomy research and scientific computing.