Since wrapping-up my series on Virtual Markets, I’ve had an opportunity to review and rewrite some of the codebase used to maintain and generate the reports published on the Projects page.
The target of my most recent rewrite, ‘FetchConvertJSONDataProd.py‘, has long been responsible for gathering and maintaining a local copy of market data from the OSRS Wiki API. The responsibilities of this script have grown over-time to include some light web scraping as well as running statistical reports of aggregate market data.
Despite numerous small revisions and refactoring attempts, it became clear to me after writing the data fetching routine for ESI that this old script deserved a full rewrite, resulting in the creation of ‘OSRSBuildDatabase.py‘ which has yielded substantial operational improvements, including but not limited to:
- Faster execution (2-3x per process)
- Effective performance scaling through multiprocessing
- Lower memory overhead (1/10th)
- Lower I/O overhead
- Permanent resolution of unwanted exceptions
- Improved resilience against injection vectors
- Greater flexibility in report generation options
Throughout this article, we’ll walk through how these improvements were achieved, drawing comparisons between our legacy and new code.
CSVs Are Not Enough
When I originally started my very first data analytics projects, most of the source data I worked with was published in comma-separated values format (CSV). Due to this and its long-standing ubiquity and persistence in the workplace, I naturally found myself using it as a default format for storing and working with tabular data.
This habit extended into the original iteration of this script, where the storage schema involved saving the results of each API call as a timestamped CSV file within file directories named by their source endpoints. While this certainly ‘worked’, in-retrospect this schema effectively emulated the form and shape of a database… without any of the real benefits.
Notwithstanding the obvious storage I/O performance penalties associated with repeatedly parsing hundreds of CSV files, this approach lent itself well to spawning strange anti-patterns, such as passing data between scripts via pickled pandas dataframes (Not recommended!). It became very clear to me that for this rewrite I needed to completely scrap this old approach and commit to storing everything in a single (SQLite) database, as I have with recent projects.
To its credit, CSV is an astonishingly simple, readable, and portable format. For projects involving small datasets running on modern hardware, any difference in performance between CSVs and proper databases may be imperceptible. However, just because you can save everything as CSV doesn’t mean that you should; by storing data in a single SQLite database for this rewrite, I recognized:
- Lower I/O and memory overhead, resulting from the elimination of unnecessary parsing and conversions.
- Easy multiprocessing implementation and the complete elimination of exceptions related to improvisational pickling shenanigans, largely thanks to SQLite providing native support for concurrency. (one write, many read with WAL enabled)
- Far fewer lines of code and improved readability through the elimination of clunky routines.
With a single, simple database now at our disposal, how do we squeeze as much performance out of it as possible?
Building Better Indices
For all but the smallest of databases, appropriately-constructed indices provide such a huge performance boost when running queries that they’re practically required. In the case of SQLite, nowhere is this better described than in the query planning documentation. So, for this script rewrite, one crucial goal involved determining the most optimal indexing schema to employ.
Let’s start by taking a look at the schema of our main table:
# create database file if it does not exist, enable write-ahead logging, create cursor/tables/columns/indices
marketdatadb = sqlite3.connect("osrsmarketdata.sqlite")
marketdatadb.execute('pragma journal_mode=wal')
cur = marketdatadb.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS marketdata (interval INT, timestamp INT, typeid INT, avgHighPrice INT, highPriceVolume INT, avgLowPrice INT, lowPriceVolume INT);")
- interval represents the amount of seconds of market activity represented in the current row. The default settings include 300, 3600, and 86400 which correspond with the /5m, /1h and /24h endpoints respectively.
- timestamp represents the UNIX timestamp demarcating the end of the current interval.
- typeid represents a unique item type
- avgHighPrice, highPriceVolume, avgLowPrice, and lowPriceVolume all represent different market price/volume statistics, corresponding with the given interval, timestamp, and typeid.
We can determine which values to index by following the WHERE clauses in our SQLite queries. Throughout our script, the values typeid, timestamp, and interval all follow these clauses, making them great candidates for indexing. The latter four values on the other hand probably aren’t worth indexing since they never follow a WHERE clause.
We’ll get started on some benchmarking next by indexing interval, timestamp, and typeid individually in-order to see how long the nine default ReportParameters take to run:
(preliminary test with no index)
.... (1hr+, terminated before completion)
cur.execute("CREATE INDEX IF NOT EXISTS marketdataindex ON marketdata (interval);")
... (5+ minutes, terminated before completion)
cur.execute("CREATE INDEX IF NOT EXISTS marketdataindex ON marketdata (timestamp);")
... (5+ minutes, terminated before completion)
cur.execute("CREATE INDEX IF NOT EXISTS marketdataindex ON marketdata (typeid);")
... 128.302 seconds
Based off of this first series of tests, it appears that the single greatest leap in performance is achieved when indexing the value ‘typeid’. The script went from having an indefinite (1h+) runtime with no index to completing in a little over 2 minutes with this single-column index.
When we consider the broader context of our data, this makes sense; there are far more unique typeids (4000+) than there are unique intervals (3), and timestamps (~1000) given the default script settings. Furthermore, the only WHERE clause involving typeid resides within the most-iterated block in the entire script (~4000 typeids * 9 reports). What would otherwise require over 36,000 full table scans to complete without an index now only requires the equivalent of 9 full table scans with a basic index, as the query planner now knows precisely where each unique typeid lives without needing to scan the entire table to verify.
We can still do a lot better than this. Let’s continue by adding another value to our index:
cur.execute("CREATE INDEX IF NOT EXISTS marketdataindex ON marketdata (typeid, interval);")
... 48.184 seconds
cur.execute("CREATE INDEX IF NOT EXISTS marketdataindex ON marketdata (typeid, timestamp);")
... 60.732 seconds
While the improvement is not quite as dramatic, we do see a ~2.5x improvement in execution time between the two-value and one-value indices. Despite there being more unique values for timestamp than for interval, we don’t quite see the same performance gains indexing timestamp, mostly due to differences in how WHERE clauses in our script handle timestamps:
cur.execute("CREATE TABLE temptable AS SELECT timestamp, avgHighPrice, highPriceVolume, avgLowPrice, lowPriceVolume FROM marketdata WHERE typeid = ? AND interval = ? AND timestamp >= ? AND timestamp <= ?;", (typeid, Report.get("IntervalSize"), Report.get("FirstTimestamp"), Report.get("LastTimestamp")))
While this query always selects individual values for typeid and interval, timestamps are always selected as a range which may include as few as one or as many as all available entries. In much the same way that you don’t need to reference the index to read through a whole chapter of a book once you know where the chapter starts, the SQLite query planner doesn’t need an index for the timestamp value if it is already going to fetch everything there.
That isn’t to say that we shouldn’t also include timestamp in our index. Including this as the third value provides some additional quickness, as there are many reports which don’t include every available timestamp:
cur.execute("CREATE INDEX IF NOT EXISTS marketdataindex ON marketdata (typeid, interval, timestamp);")
... 29.442 seconds
While this might seem like a reasonable place to stop, we can actually do a tiny bit better. There are many WHERE clauses throughout the new script which only reference interval and timestamp. These type of queries could potentially run faster on an index that doesn’t use typeid as its primary key, eliminating a very small amount of overhead performing binary searches:
cur.execute("CREATE INDEX IF NOT EXISTS marketdataindex ON marketdata (typeid, interval, timestamp);")
cur.execute("CREATE INDEX IF NOT EXISTS marketdataintervalindex ON marketdata (interval, timestamp);")
... 25.877 seconds
And this is where we’ll stop; while it might be tempting to index timestamp on its own, this won’t result in any performance gains since there are no queries in the script which only reference timestamp in the WHERE clause.
While using multiple indices is fairly niche, this can be employed in some scenarios to squeeze tiny drops of performance out of queries. These methods may not apply to all workloads and it is important to consider your use-case, since excessive indexing does add some write overhead and overall results in a slightly larger database.
Query Optimization
While a database with a great index is a great place to start, our legacy script contained a lot of sub-optimal SQLite queries and data handling routines. In-order to achieve the speeds recorded earlier, we need to consolidate and rework some of our code.
Let’s take the following line from our original script as an example, adapted to our new database schema:
# Mean calculations for volume.
tempcur.executescript('''CREATE TEMP TABLE MeanLowVolumeTable AS SELECT AVG(lowPriceVolume) AS MeanVolumeLow FROM temptable; CREATE TEMP TABLE MeanHighVolumeTable AS SELECT AVG(highPriceVolume) AS MeanVolumeHigh FROM temptable;''')
The cumulative amount of time taken iterating over this line in our script amounts to about 3 seconds. Let’s compare this with a more-optimized approach:
tempcur.execute("SELECT AVG(lowPriceVolume) AS MeanVolumeLow, AVG(highPriceVolume) AS MeanVolumeHigh FROM temptable;")
rows = tempcur.fetchall()
for row in rows:
resultdict["MeanVolumeLow"] = row[0]
resultdict["MeanVolumeHigh"] = row[1]
Despite effectively doing the exact same work, this routine cumulatively takes only 0.8 seconds, for a 3.75x improvement in execution speed.
So, what changes are responsible for this huge performance improvement? First, we should recognize that the first example actually contains two distinct queries which are each individually executed. Since they’re both referencing the same source table, we can consolidate these as a single query:
tempcur.execute('''CREATE TEMP TABLE MeanLowVolumeTable AS SELECT AVG(lowPriceVolume) AS MeanVolumeLow, AVG(highPriceVolume) AS MeanVolumeHigh FROM temptable;''')
This seemingly simple consolidation reduces the cumulative runtime from 3 to 1.8 seconds. Since the block containing this query will be iterated over tens of thousands of times, we’ve halved the amount of times the cursor is invoked, eliminating a lot of redundant read operations from the temporary table.
The remaining second of improvement comes out of how the resulting data is handled. Instead of saving the query output to a temporary table, we instead capture and save it to a Python dictionary. This is not only more efficient, but also streamlines our multiprocessing implementation, since processes in a worker pool can all easily pass this data back to the main thread to be saved to a new table on our main SQLite database after all calculations are complete.
While the sum of all of these improvements came to be quite significant, we still have to contend with the nature of calculating median values. For reference, this is an example of what the median formula looks like in plain SQLite:
SELECT AVG(avgLowPrice) AS MedianLow FROM (SELECT avgLowPrice FROM temptable ORDER BY avgLowPrice LIMIT 2 - (SELECT COUNT(*) FROM temptable) % 2 OFFSET (SELECT (COUNT(*) - 1) / 2 FROM temptable));
It is as slow as it is ugly, and despite my better attempts at finding more efficient solutions using pandas and other tools, this appears to be what peak performance looks like. That doesn’t mean I’ll stop searching for median calculation optimization strategies, but that will be a battle for another day. What I did manage to achieve was a small but satisfying removal of a few seconds of runtime by incorporating a small heuristic which skips calculating median volume in circumstances where, given a lack of records, the median must be equal to 0.
All-in-all, while there are many different ways to generate output from SQLite, some approaches are far more efficient than others. As much as indices are necessary to efficiently read data, queries must be deliberately prescribed with appropriate scope and frequency in-order to avoid wasting system resources through unnecessary read operations. Additionally, while it might be convenient using temporary tables to store and manipulate data all within the syntax of SQLite, it might be worth identifying and benchmarking possible code-switching avenues if maximizing performance is the goal.
Parameterize Everything!
A direct and naive way of incorporating Python variables into SQLite queries involves using f-strings (formatted string literals). This also happens to be one of the worst possible methods due to the inherent risk of SQL injection it presents. While it might be easy to discount this risk if your code doesn’t use user-submitted fields, there is still risk inherent in vulnerable code being copied into a project that does.
Instead, Python values should be passed to SQLite using parameterization. This is best described in the Python documentation for the sqlite3 module, with further details provided in the SQLite syntax documentation. This approach reduces the attack vector for SQL injection by effectively preventing any passed variables from modifying query syntax.
Parameterization with Python/SQLite also provides quite a few practical advantages over using f-strings. Let’s take a list of dictionaries named ResultList and walk through two methods we could employ to insert specific dictionary values into a SQLite table.
First, an example using f-strings:
for reportresult in ResultList:
cur.execute(f"INSERT OR REPLACE INTO marketstatstest (Type, typeid) VALUES ('{reportresult.get('Type')}', '{reportresult.get('typeid')}')")
Then, an example using parameterization:
cur.executemany("INSERT OR REPLACE INTO marketstatstest VALUES (:typeid, :Type);", ResultList)
While both of these approaches accomplish the same goal, the second example:
- Is easier to read
- Executes three times as quickly
- Features built-in list comprehension via executemany
- Is more resilient to SQL injection
Despite these advantages, there are some limitations when setting parameters. First, parameterization is not a golden bullet against user input, which should still be sanitized in-order to prevent malevolent data from being saved to databases and potentially used downstream.
Secondly, parameters can not be used for column or table names. While this limitation did not present an issue for this rewrite, my ESI data collection routine for EvE Online does unfortunately require the use of f-strings in SQLite queries; since the database schema for that project involves naming tables based off of a list of starting parameters (region name), parameterization is not currently possible. Despite not accepting user input, my long-term goal is to change the database schema of that project in-order to adhere with best practice. Until then, I’ve included integer/string sanitation routines in all of my scripts that still require the use of f-strings for SQLite.
Regardless, parameterization should be treated as the default method for passing Python variables to SQLite queries. While f-strings are an extremely useful tool in Python, their use within SQL queries should generally be avoided.
Accounting for Schema Change
As part of this rewrite, I decided to break the original reporting schema. While both the old and new schemas effectively contain the exact same data, the old schema is notably ‘flatter’, containing an extremely large number of columns unique to each statistical report. I decided to make this change, mainly because tables with more rows and fewer columns are easier to read, understand, and index, but also because I was already going to be breaking downstream functionality anyway due to the silly picklefile routine getting axed.
This comes at-odds with my downstream script, OSRSReportGen.py, which takes the statistical reports generated by OSRSBuildDatabase.py and uses this data along with the /latest endpoint to generate the published reports seen on the Projects page. While I could rewrite the syntax of these downstream reports to accept the new schema, each of these reports has a corresponding article which would require revision, which would take a long time. So, I decided instead to simply re-incorporate the original ‘flattening’ into OSRSReportGen.py by way of a huge ugly block of SQLite queries:
cur.execute("CREATE TABLE MasterTable(id TEXT, mappinglimit INT, mappingname TEXT, mappinghighalch INT, high INT, low INT, WeeklyMeanLow INT, WeeklyMeanHigh INT, WeeklyMeanVolumeLow TEXT, WeeklyMeanVolumeHigh TEXT, WeeklyMedianLow INT, WeeklyMedianHigh INT, WeeklyMedianVolumeLow INT, WeeklyMedianVolumeHigh INT, WeeklyMinLow INT, WeeklyMinHigh INT, WeeklyMaxLow INT, WeeklyMaxHigh INT, MonthlyMeanLow INT, MonthlyMeanHigh INT, MonthlyMeanVolumeLow TEXT, MonthlyMeanVolumeHigh TEXT, MonthlyMedianLow INT, MonthlyMedianHigh INT, MonthlyMedianVolumeLow INT, MonthlyMedianVolumeHigh INT, MonthlyMinLow INT, MonthlyMinHigh INT, MonthlyMaxLow INT, MonthlyMaxHigh INT, YearlyMeanLow INT, YearlyMeanHigh INT, YearlyMeanVolumeLow TEXT, YearlyMeanVolumeHigh TEXT, YearlyMedianLow INT, YearlyMedianHigh INT, YearlyMedianVolumeLow INT, YearlyMedianVolumeHigh INT, YearlyMinLow INT, YearlyMinHigh INT, YearlyMaxLow INT, YearlyMaxHigh INT, GranularDailyMeanLow INT, GranularDailyMeanHigh INT, GranularDailyMeanVolumeLow TEXT, GranularDailyMeanVolumeHigh TEXT, GranularDailyMedianLow INT, GranularDailyMedianHigh INT, GranularDailyMedianVolumeLow INT, GranularDailyMedianVolumeHigh INT, GranularDailyMinLow INT, GranularDailyMinHigh INT, GranularDailyMaxLow INT, GranularDailyMaxHigh INT, GranularBiweeklyMeanLow INT, GranularBiweeklyMeanHigh INT, GranularBiweeklyMeanVolumeLow TEXT, GranularBiweeklyMeanVolumeHigh TEXT, GranularBiweeklyMedianLow INT, GranularBiweeklyMedianHigh INT, GranularBiweeklyMedianVolumeLow INT, GranularBiweeklyMedianVolumeHigh INT, GranularBiweeklyMinLow INT, GranularBiweeklyMinHigh INT, GranularBiweeklyMaxLow INT, GranularBiweeklyMaxHigh INT, GranularMonthlyMeanLow INT, GranularMonthlyMeanHigh INT, GranularMonthlyMeanVolumeLow TEXT, GranularMonthlyMeanVolumeHigh TEXT, GranularMonthlyMedianLow INT, GranularMonthlyMedianHigh INT, GranularMonthlyMedianVolumeLow INT, GranularMonthlyMedianVolumeHigh INT, GranularMonthlyMinLow INT, GranularMonthlyMinHigh INT, GranularMonthlyMaxLow INT, GranularMonthlyMaxHigh INT, VeryGranularFiveMinuteMeanLow INT, VeryGranularFiveMinuteMeanHigh INT, VeryGranularFiveMinuteMeanVolumeLow TEXT, VeryGranularFiveMinuteMeanVolumeHigh TEXT, VeryGranularFiveMinuteMedianLow INT, VeryGranularFiveMinuteMedianHigh INT, VeryGranularFiveMinuteMedianVolumeLow INT, VeryGranularFiveMinuteMedianVolumeHigh INT, VeryGranularFiveMinuteMinLow INT, VeryGranularFiveMinuteMinHigh INT, VeryGranularFiveMinuteMaxLow INT, VeryGranularFiveMinuteMaxHigh INT, VeryGranularHourlyMeanLow INT, VeryGranularHourlyMeanHigh INT, VeryGranularHourlyMeanVolumeLow TEXT, VeryGranularHourlyMeanVolumeHigh TEXT, VeryGranularHourlyMedianLow INT, VeryGranularHourlyMedianHigh INT, VeryGranularHourlyMedianVolumeLow INT, VeryGranularHourlyMedianVolumeHigh INT, VeryGranularHourlyMinLow INT, VeryGranularHourlyMinHigh INT, VeryGranularHourlyMaxLow INT, VeryGranularHourlyMaxHigh INT, VeryGranularDailyMeanLow INT, VeryGranularDailyMeanHigh INT, VeryGranularDailyMeanVolumeLow TEXT, VeryGranularDailyMeanVolumeHigh TEXT, VeryGranularDailyMedianLow INT, VeryGranularDailyMedianHigh INT, VeryGranularDailyMedianVolumeLow INT, VeryGranularDailyMedianVolumeHigh INT, VeryGranularDailyMinLow INT, VeryGranularDailyMinHigh INT, VeryGranularDailyMaxLow INT, VeryGranularDailyMaxHigh INT, ProductName TEXT, RecipeType TEXT, QtyProduced INT, ProcessingCost INT, ingredient1id INT, ingredient1Qty TEXT, ingredient2id INT, ingredient2Qty TEXT, ingredient3id INT, ingredient3Qty TEXT);")
cur.execute("INSERT INTO MasterTable(id, mappinglimit, mappingname, mappinghighalch, high, low) SELECT typeid, buylimit, name, highalch, high, low FROM Mapping RIGHT JOIN latest ON Mapping.typeid = latest.id;")
cur.execute("UPDATE MasterTable SET WeeklyMeanLow = MeanLow, WeeklyMeanHigh = MeanHigh, WeeklyMeanVolumeLow = MeanVolumeLow, WeeklyMeanVolumeHigh = MeanVolumeHigh, WeeklyMedianLow = MedianLow, WeeklyMedianHigh = MedianHigh, WeeklyMedianVolumeLow = MedianVolumeLow, WeeklyMedianVolumeHigh = MedianVolumeHigh, WeeklyMinLow = MinLow, WeeklyMinHigh = MinHigh, WeeklyMaxLow = MaxLow, WeeklyMaxHigh = MaxHigh FROM marketstats WHERE Type = 'Weekly' AND id = typeid;")
cur.execute("UPDATE MasterTable SET MonthlyMeanLow = MeanLow, MonthlyMeanHigh = MeanHigh, MonthlyMeanVolumeLow = MeanVolumeLow, MonthlyMeanVolumeHigh = MeanVolumeHigh, MonthlyMedianLow = MedianLow, MonthlyMedianHigh = MedianHigh, MonthlyMedianVolumeLow = MedianVolumeLow, MonthlyMedianVolumeHigh = MedianVolumeHigh, MonthlyMinLow = MinLow, MonthlyMinHigh = MinHigh, MonthlyMaxLow = MaxLow, MonthlyMaxHigh = MaxHigh FROM marketstats WHERE Type = 'Monthly' AND id = typeid;")
cur.execute("UPDATE MasterTable SET YearlyMeanLow = MeanLow, YearlyMeanHigh = MeanHigh, YearlyMeanVolumeLow = MeanVolumeLow, YearlyMeanVolumeHigh = MeanVolumeHigh, YearlyMedianLow = MedianLow, YearlyMedianHigh = MedianHigh, YearlyMedianVolumeLow = MedianVolumeLow, YearlyMedianVolumeHigh = MedianVolumeHigh, YearlyMinLow = MinLow, YearlyMinHigh = MinHigh, YearlyMaxLow = MaxLow, YearlyMaxHigh = MaxHigh FROM marketstats WHERE Type = 'Yearly' AND id = typeid;")
cur.execute("UPDATE MasterTable SET GranularDailyMeanLow = MeanLow, GranularDailyMeanHigh = MeanHigh, GranularDailyMeanVolumeLow = MeanVolumeLow, GranularDailyMeanVolumeHigh = MeanVolumeHigh, GranularDailyMedianLow = MedianLow, GranularDailyMedianHigh = MedianHigh, GranularDailyMedianVolumeLow = MedianVolumeLow, GranularDailyMedianVolumeHigh = MedianVolumeHigh, GranularDailyMinLow = MinLow, GranularDailyMinHigh = MinHigh, GranularDailyMaxLow = MaxLow, GranularDailyMaxHigh = MaxHigh FROM marketstats WHERE Type = 'GranularDaily' AND id = typeid;")
cur.execute("UPDATE MasterTable SET GranularBiweeklyMeanLow = MeanLow, GranularBiweeklyMeanHigh = MeanHigh, GranularBiweeklyMeanVolumeLow = MeanVolumeLow, GranularBiweeklyMeanVolumeHigh = MeanVolumeHigh, GranularBiweeklyMedianLow = MedianLow, GranularBiweeklyMedianHigh = MedianHigh, GranularBiweeklyMedianVolumeLow = MedianVolumeLow, GranularBiweeklyMedianVolumeHigh = MedianVolumeHigh, GranularBiweeklyMinLow = MinLow, GranularBiweeklyMinHigh = MinHigh, GranularBiweeklyMaxLow = MaxLow, GranularBiweeklyMaxHigh = MaxHigh FROM marketstats WHERE Type = 'GranularBiweekly' AND id = typeid;")
cur.execute("UPDATE MasterTable SET GranularMonthlyMeanLow = MeanLow, GranularMonthlyMeanHigh = MeanHigh, GranularMonthlyMeanVolumeLow = MeanVolumeLow, GranularMonthlyMeanVolumeHigh = MeanVolumeHigh, GranularMonthlyMedianLow = MedianLow, GranularMonthlyMedianHigh = MedianHigh, GranularMonthlyMedianVolumeLow = MedianVolumeLow, GranularMonthlyMedianVolumeHigh = MedianVolumeHigh, GranularMonthlyMinLow = MinLow, GranularMonthlyMinHigh = MinHigh, GranularMonthlyMaxLow = MaxLow, GranularMonthlyMaxHigh = MaxHigh FROM marketstats WHERE Type = 'GranularMonthly' AND id = typeid;")
cur.execute("UPDATE MasterTable SET VeryGranularFiveMinuteMeanLow = MeanLow, VeryGranularFiveMinuteMeanHigh = MeanHigh, VeryGranularFiveMinuteMeanVolumeLow = MeanVolumeLow, VeryGranularFiveMinuteMeanVolumeHigh = MeanVolumeHigh, VeryGranularFiveMinuteMedianLow = MedianLow, VeryGranularFiveMinuteMedianHigh = MedianHigh, VeryGranularFiveMinuteMedianVolumeLow = MedianVolumeLow, VeryGranularFiveMinuteMedianVolumeHigh = MedianVolumeHigh, VeryGranularFiveMinuteMinLow = MinLow, VeryGranularFiveMinuteMinHigh = MinHigh, VeryGranularFiveMinuteMaxLow = MaxLow, VeryGranularFiveMinuteMaxHigh = MaxHigh FROM marketstats WHERE Type = 'VeryGranularFiveMinute' AND id = typeid;")
cur.execute("UPDATE MasterTable SET VeryGranularHourlyMeanLow = MeanLow, VeryGranularHourlyMeanHigh = MeanHigh, VeryGranularHourlyMeanVolumeLow = MeanVolumeLow, VeryGranularHourlyMeanVolumeHigh = MeanVolumeHigh, VeryGranularHourlyMedianLow = MedianLow, VeryGranularHourlyMedianHigh = MedianHigh, VeryGranularHourlyMedianVolumeLow = MedianVolumeLow, VeryGranularHourlyMedianVolumeHigh = MedianVolumeHigh, VeryGranularHourlyMinLow = MinLow, VeryGranularHourlyMinHigh = MinHigh, VeryGranularHourlyMaxLow = MaxLow, VeryGranularHourlyMaxHigh = MaxHigh FROM marketstats WHERE Type = 'VeryGranularHourly' AND id = typeid;")
cur.execute("UPDATE MasterTable SET VeryGranularDailyMeanLow = MeanLow, VeryGranularDailyMeanHigh = MeanHigh, VeryGranularDailyMeanVolumeLow = MeanVolumeLow, VeryGranularDailyMeanVolumeHigh = MeanVolumeHigh, VeryGranularDailyMedianLow = MedianLow, VeryGranularDailyMedianHigh = MedianHigh, VeryGranularDailyMedianVolumeLow = MedianVolumeLow, VeryGranularDailyMedianVolumeHigh = MedianVolumeHigh, VeryGranularDailyMinLow = MinLow, VeryGranularDailyMinHigh = MinHigh, VeryGranularDailyMaxLow = MaxLow, VeryGranularDailyMaxHigh = MaxHigh FROM marketstats WHERE Type = 'VeryGranularDaily' AND id = typeid;")
cur.execute("UPDATE MasterTable SET ProductName = Recipes.ProductName, RecipeType = Recipes.RecipeType, QtyProduced = Recipes.QtyProduced, ProcessingCost = Recipes.ProcessingCost, ingredient1id = Recipes.ingredient1id, ingredient1Qty = Recipes.ingredient1Qty, ingredient1Qty = Recipes.ingredient1Qty, ingredient2id = Recipes.ingredient2id, ingredient2Qty = Recipes.ingredient2Qty, ingredient3id = Recipes.ingredient3id, ingredient3Qty = Recipes.ingredient3Qty FROM Recipes WHERE MasterTable.id = Recipes.id;")
While extremely ugly, this took virtually no time to hobble together, with the whole script still executing in about a second or two. Since this reporting script works against a data set that is orders of magnitude smaller than the raw market data churned by the script we just spent the better part of this article enhancing, we can take some liberties with inefficient code if it means getting it deployed fast and to the original specification.
The lesson of this section is to always take downstream changes into account. Anticipate when and if downstream tools will break following a schema change, and have a plan. While I will gladly break and rebuild my own workflows, people working in professional environments will not take kindly to reports suddenly not working because you wanted to save some money on cloud compute. Ancient, inefficient and ugly code that accomplishes a task is worth infinitely more than great code that does the thing wrong, even if the wrong thing basically amounts to some sorting.
Closing Notes
Rewriting legacy code not only can provide operational improvements to your runbook; it serves as an excellent practice exercise and presents an opportunity to challenge old habits and apply new knowledge against old routines.
Thank you for reading!