Virtual Markets, Part Eight: Station Trading

With our database now accruing EvE Online market data from ESI, it’s time to start putting it to work.

In this article, we will:

Look back at some strategies we’ve employed in the past for Old School Runescape and determine how well they translate to the markets of EvE Online
Pick a trade strategy and build a reporting workflow using our new database of ESI market data.
Test our strategy in-game and assess potential for improvement
Implement more advanced reporting techniques including linear regression analysis.

Assessing Past Techniques

So far, we’ve applied a handful of different reporting strategies against OSRS market data. Let’s run through and assess how they might apply to EvE Online:

Bid-Ask Spread

In our second article on virtual markets, we developed a reporting strategy which capitalizes on the Bid-Ask, or “high-low” spread. As a recap, the Bid price is the highest price that a trader is currently willing to spend to acquire a commodity, while the Ask price is the lowest price that a trader is currently willing to accept in-exchange for the same commodity. The difference between these prices is referred to as the spread, and by buying low and selling high within this spread, a player can obtain the spread, sans fees, as profit.

The Bid-Ask spread is an elementary market inefficiency present in most real and virtual exchanges, with the markets of EvE Online being no exception. Trading on the Bid-Ask spread in EvE Online is colloquially referred to as station trading; by placing buy (bid) and sell (ask) orders at the same station, players can participate in the game’s economy with a minimal understanding of the game’s mechanics beyond the market trading interface. As a result, station trading presents a highly-accessible and low-effort market opportunity featuring variable risk and reward.

Price Floors

In our third article on virtual markets, we developed a reporting strategy capitalizing on implicit price floors resulting from currency faucets. In the case of OSRS, we developed a proof illustrating how the game mechanic of alchemy enforces price floors due to the direct conversion of commidities into currency, functioning similarly to agricultural price subsidies in real-world economies.

While EvE Online has no direct equivalent to alchemy, there are plenty of faucets and sinks governing the flow of currency and commodities into and out of the game, resulting in the enforcement of price floors and ceilings which may be worth investigating and reporting against.

Supply Shocks

In our fourth article on virtual markets, we developed a reporting tool which detects sudden dips in commodity prices. This strategy capitalizes on various Grand Exchange mechanics and restrictions which impose a supply-side bias on the market, resulting in frequent temporary dips in commodity prices.

In EvE Online, players are not subject to buy limits and enjoy the freedom to place a far greater quantity of concurrent market orders. Furthermore, in-contrast to the opaque dark pool presented by the Grand Exchange of OSRS, players in EvE have access to a transparent and open order book. The result is that markets in EvE Online do not possess the same structural market inefficiencies and resulting biases as the Grand Exchange. While monitoring price spikes and dips in EvE Online might be interesting, the different game mechanics preclude this strategy being used for the same purposes as in OSRS.

Low-Effort Processing

In our fifth article on virtual markets, we identified various low-effort processing tasks and built a reporting strategy around a formula table in-order to facilitate near-arbitrage trade.

While EvE Online possesses many manufacturing processes which could be considered low-effort in a vacuum, the industry mechanics in-aggregate present a dense mosaic of opportunity cost considerations which do not apply to OSRS. While we will spend some time in the coming articles peeling back some layers, the vastly different manufacturing mechanics between the two games preclude drawing any direct parallel here.

Starting-Off Simple: Station Trading

Between these four trade strategies, we’ve identified two methods which present direct or close parallels between OSRS and EvE Online: trading against the bid-ask spread (or station trading), and trading against price floors/ceilings.

Between these two trading strategies, station trading presents a better starting point; it can be performed with a minimal understanding of game mechanics and a narrow set of core requisite in-game trading skills. It will also help establish some trading fundamentals before we move into more advanced trade strategies, including trading against price floors and ceilings or performing interregional trade which we will cover in future articles.

Prerequisites

Trading effectively in EvE Online requires that you train your character’s skills. The basic mechanics of Skill training can be succinctly summarized as follows:

“Skill books” are purchased and consumed by characters which unlock corresponding skills
Skills are comprised of 5 levels denoted by roman numerals, which are trained passively in real-time whether or not you are signed into the game
Different skills have different multipliers which dictate how long they take to train; the largest multipliers can result in skills needing months of time to reach level V.
Skill training duration can be reduced or eliminated by various means and/or expense.

Effective station trading in EvE Online requires training a small handful of skills. We’ll start by reviewing the trade skills which influence market fees:

Accounting: This skill determines the “sales tax” players will spend anytime a unit of a commodity is sold on a market. This starts at 8% at level 0 and can be reduced to as low as 3.6% at level V.
Broker Relations: This skill determines the “brokerage fees” players will spend anytime an order (whether Buy or Sell) is originated and not instantly executed at an NPC-aligned station. This starts at 3% at level 0 and can be reduced to as low as 1.5% at level V.
Advanced Broker Relations: This skill determines the “relist fees” players will spend anytime an existing order is changed. Generally speaking, players will spend between 20-50% of an orders’ brokerage fee as a relist fee any time an order is changed, with the percentage being reduced by 6% per level trained.

Training each of these to level V should be a primary goal for any trade-oriented character since they cut base fees by over half, directly improving a trader’s bottom-line. Accounting in most circumstances is the most important out of all three since sales taxes comprise the largest portion of all trading fees applied against orders which are successfully cleared. Broker relations is less impactful, but still important to train when using NPC-aligned stations since they are never refunded even if an originated order never clears. Advanced broker relations is especially important when trading items competitively since repeatedly relisting an item in-order to under or over-cut competition will eat into profit margins.

While it is possible for the brokerage fee portion of this rate to be marginally reduced further through other game mechanics, including improving NPC standings or leveraging player-owned upwell structures to originate orders, for the purposes of all of our future analysis and reporting we will use the following fee schedule when trading on markets, assuming a base level of V in Accounting and Broker Relations:

Originating and clearing a Sell order: 5.1%
Originating and clearing a Buy order: 1.5%
Clearing an existing Buy order or “insta-selling”: 3.6%
Clearing an existing Sell order or “insta-buying”: 0%
Relisting an order: 0.3% (base) + 1.5% (any increase in price)

Onto more trade-related skills, Trade, Retail, Wholesale and Tycoon all increase the number of concurrent orders a character can have open on the market which is crucial for scaling trade operations.

Finally, Marketing, Procurement, Daytrading and Visibility all allow remote origination and modification of orders. In the context of station trading, they’re not very important, however having a few levels in these skills can be useful in some circumstances, namely if traders want to leverage lower brokerage fees offered by player-owned upwell structures.

Besides having these skills trained, trade characters will need some starting capital and will need to be docked at or near a trading hub. For the purpose of our exercise, we will be reporting against the region The Forge.

Reporting Goals

As a recap, our basic goal when trading against the bid-ask spread is to identify goods which:

Maximize profit: Volume traded * (Ask Price – Bid Price – Fees)
Maximize return on investment (ROI): (Ask Price – Bid Price – Fees) / Bid Price

The importance of each of these values depends on a players’ individual context; players with limited capital and lower risk tolerance will benefit more by maximizing ROI, while players with more capital than they can spend will benefit more by maximizing absolute profit. Calculating both of these will be a primary reporting goal for this article since both values are important in determining the value proposition of a trade opportunity.

Certain considerations for OSRS such as buy limits and limited open order slots no longer apply in the same way to EvE Online, however as we start building our queries we will discover new factors which will need to be taken into account.

With these key goals in-mind, let’s get started on some scripting examples and see what trade opportunities we can find.

Starting our Script: Calculating Profit and ROI

We’ll start by importing some modules and opening a read-only connection to the database we started building in our last article:

# modules
import sqlite3
import pandas as pd
import os

# change working directory to script location
pathtorunfile = os.path.dirname(__file__)
os.chdir(pathtorunfile)

# runtime
if __name__ == '__main__':

	## connect to primary database (READ-ONLY)
	marketdatadb = sqlite3.connect("file:marketdata.sqlite?mode=ro", uri=True)
	cur = marketdatadb.cursor()

Next, we’ll build two temporary tables. One will contain the most current order book data, and another will contain historical price and volume data from the past day for the region The Forge:

	## populate temporary table with current order book for The Forge
	cur.execute("CREATE TEMPORARY TABLE TheForge_now AS SELECT * FROM TheForge WHERE timestamp = (SELECT max(timestamp) FROM TheForge);")

	## populate temporary table with the most recent market history for The Forge
	cur.execute("CREATE TEMPORARY TABLE TheForgeHistory_now AS SELECT * FROM TheForgeHistory WHERE date = (SELECT DISTINCT date FROM TheForgeHistory ORDER BY date DESC LIMIT 1);")

Some additional queries reveal the prevailing highest Buy and lowest Sell orders on the current order book for each commodity:

	# obtain max buy price, min sell price for all type_ids
	cur.execute("CREATE TEMPORARY TABLE MaxBuyJita AS SELECT max(price) AS max_buy_price, * FROM TheForge_now WHERE is_buy_order = 1 GROUP BY type_id;")
	cur.execute("CREATE TEMPORARY TABLE MinSellJita AS SELECT min(price) AS min_sell_price, * FROM TheForge_now WHERE is_buy_order = 0 GROUP BY type_id;")

The difference between the highest Buy and lowest Sell order is our nominal spread. In-order to obtain our effective real spread, we need to apply the aforementioned trade fees. We’ll also use this opportunity to complete some additional housekeeping:

	# concatenate tables and calculate real spread
	cur.execute("CREATE TEMPORARY TABLE TheForge_now_concat AS SELECT (min_sell_price - max_buy_price - (0.051 * min_sell_price) - (0.015 * max_buy_price)) AS spread, * FROM TheForgeHistory_now, MaxBuyJita, MinSellJita, InvTypes WHERE MaxBuyJita.type_id = MinSellJita.type_id AND MaxBuyJita.type_id = TheForgeHistory_now.type_id AND MaxBuyJita.type_id = InvTypes.typeid AND MinSellJita.type_id = TheForgeHistory_now.type_id AND MinSellJita.type_id = InvTypes.typeid AND TheForgeHistory_now.type_id = InvTypes.typeid AND spread > 0;")

(0.015 * max_buy_price) represents the fee associated with originating a Buy order at the prevailing maximum Buy price, while (0.051 * min_sell_price) represents the fee associated with originating and successfully clearing a Sell order at the prevailing minimum Sell price. The WHERE conditional statement at the end serves to filter any items with a negative real spread.

We’ve also used this opportunity to combine different table elements; InvTypes provides English item names and by concatenating all of our other temporary tables we eliminate any items which either had no trade volume in the past day or which aren’t presently represented by the order book.

Finally, we can now take the product of our spread and daily volume to obtain values for potential profit and ROI. Since a successful trade requires making both a purchase and a sale, our calculation for potential profit needs to be divided by two:

	# calculate potential profit and ROI percentage for all type_ids
	cur.execute("CREATE TEMPORARY TABLE TheForge_now_PotentialProfit AS SELECT volume * spread / 2 AS PotentialDailyProfit, (spread / max_buy_price) * 100 AS PctROI, type_id, typeName AS name FROM TheForge_now_concat ORDER BY PotentialDailyProfit DESC;")

	# print results to terminal
	testprint = pd.read_sql("SELECT * FROM TheForge_now_PotentialProfit;", marketdatadb)
	print(testprint)

At this point, we have effectively reached parity with our OSRS reporting, so it is time to start reviewing our report results to see if we need to make any tweaks to account for any EvE Online market idiosyncrasies.

Other Factors: Trade Bias and Competition

Executing our script reveals the following top results sorted by potential profitability:

      PotentialDailyProfit  ...                                         name
0             3.559500e+10  ...                 Metenox Moon Drill Blueprint
1             1.041767e+10  ...              Hakim's Modified Gyrostabilizer
2             8.885902e+09  ...  Estamel's Modified Ballistic Control System
3             6.711450e+09  ...                                    Avalanche
4             6.181000e+09  ...                          Avalanche Blueprint

These results are largely comprised of blueprints which are reusable components required in manufacturing. If we take a closer look at the highest-ranked item, “Metenox Moon Drill Blueprint”, we can better understand their position in our report results:

Minimum Sell order is listed at 6B ISK, while maximum Buy order is listed at 600M ISK, resulting in a massive spread, sans fees, of 5.1B ISK.
Volume is also substantial at this price, with 14 units exchanged in the past day
The product of these values places potential profitability at 35B ISK per day.

Like all things, if something seems too good to be true, it probably is. Station trading typically requires originating and clearing both Buy and Sell orders, however many blueprints feature item faucet mechanics which result in only Sell orders being cleared. This is due to the fact that these blueprints are exclusively brought into the game by NPC traders who sell them at fixed prices. If a player no longer needs their blueprints, they can be quickly sold back to the market at a price marginally below the fixed NPC rate. While some players might find success placing Buy orders at low-ball prices in-order to “fish” for blueprints as an extremely high-ROI, low absolute return trade strategy, this approach fails to maximize potential profit since Buy orders are seldomly, if ever cleared at prices far below their fixed list price.

So, how do we ensure that our report weighs results against the trade balance between both ends of the spread?

Building Heuristic Methods for Detecting Trade Bias

If you’ve read our reporting strategies for OSRS, you might remember that all of the historical market data endpoints provided by the OSRS Wiki API provide both ‘High’ and ‘Low’ volume and price data, representing the clearing of Sell and Buy orders respectively at each end of the spread. With this information, we can directly infer trade bias and adjust profit expectations accordingly.

The market history endpoint provided by ESI functions very similarly, however price and volume data is represented as an aggregate of both Buy and Sell order activity. Without any differentiation between the two, it is impossible to directly read trade bias.

Fortunately, we can overcome this limitation by leveraging the strengths of both the market history endpoint as well as the more granular, detailed, and recent order book in-order to infer trade bias. One way this can be achieved is by measuring the minimum distance between the average price value provided by the market history endpoint and the prevailing maximum Buy and minimum Sell orders defining the current spread. Trade which is heavily biased toward Sell orders will have an average price close or equal to the prevailing minimum Sell price at the top edge of the spread, which is explicitly evident in the case of “Metenox Moon Drill Blueprint” which features an average price exactly matching the minimum Sell price of 6B ISK. Inversely, items featuring balanced trade will feature an average price equidistant from the Buy and Sell ends of the spread. This weighing mechanism can be represented by a simple formula, returning the maximum nominal quantity of ISK a trader can capture as profit per unit:

min((min_sell_price - average), (average - max_buy_price))

Implementing the aforementioned formula into our potential profit query is a pretty straightforward procedure since all of the necessary data is already contained in our concatenated table. “spread / 2” has been replaced by the contents in bold:

	# calculate potential profit and ROI percentage for all type_ids using historical volume from past day, weighed against Buy vs Sell bias
	cur.execute("CREATE TEMPORARY TABLE TheForge_now_PotentialProfit AS SELECT volume * (spread / (min_sell_price - max_buy_price)) * min((min_sell_price - average), (average - max_buy_price)) AS PotentialDailyProfit, (spread / max_buy_price) * 100 AS PctROI, type_id, typeName AS name FROM TheForge_now_concat ORDER BY PotentialDailyProfit DESC;")

We no longer need to divide by two, since the min function captures at-most one half of the nominal spread. We likewise need to multiply this result by the ratio of our real spread and nominal spread in-order to appropriately account for the cost of fees.

Executing our modified script now returns the following top five results:

...       
PotentialDailyProfit	PctROI	type_id	name
10245011321.5973	108.545662933055	14538	Hakim's Modified Gyrostabilizer
7554784096.3323	75.3585421702074	14690	Estamel's Modified Ballistic Control System
3704275957.0784	366.181318681319	83382	Uncompromising Scarlet Gloss - Unlimited
3466308006.99379	23.475374732334	18698	Core X-Type 100MN Afterburner
1764970778.21687	87.1925696594427	28804	Mid-grade Harvest Delta

Reviewing the trade history for each of these items confirms that this basic heuristic method appears to have worked; the top-ranked items possess not only great profit potential per unit and trade volume, but also feature more balanced trade on both ends of the spread, providing station traders with the latitude to successfully originate and clear Buy and Sell orders within the spread.

Competition is Tough!

However, there is another devil hiding in the details; if we take a close look at the order book data for “Core X-Type 100MN Afterburner”, we will find that there are many market participants vying for control over each end of the spread. At the time of this writing, there are 16 Sell orders for this item in The Forge, 7 of which have been updated within the past day. There are likewise an even greater quantity of Buy orders.

In OSRS, competition was never a prominent trade factor due to a number of constraints including buy limits, limited open order slots, and the opaque order book of the Grand Exchange. In-contrast, competition plays a huge factor when trading in EvE Online; there are no buy limits, more order slots, and the order book is completely transparent, permitting players to directly see spread magnitude and view each others’ orders. This results in emergent trade behavior commonly referred to as “0.01-ISKing” whereby players compete for access to the edge of the spread through repeated marginal over and under-cutting.

The phenomenon of 0.01-ISKing has been mitigated by the game developers in recent years with the introduction of relist fees and price precision restrictions, however it still plays a substantial role in the market, as we can see with “Core X-Type 100MN Afterburner” and many other items where traders constantly step over each other for access to the edge of the spread.

So, how does competition impact the efficacy of station trading, and how do we create and apply a weight against our reporting with this information?

Since all successful trade activity occurs within the spread, and since the spread is always defined at one point in-time by both the single highest Buy order and the single lowest Sell order, only one player may control one side of the spread at a time. As a result, as more participants enter a market and compete for access to the spread, they necessarily crowd each other out, splitting a finite quantity of time into smaller pieces. Assuming that all active traders are spending an equal amount of their time monitoring and updating their orders, this means that potential trade volume available to a single trader on either side of the spread is divisible by the total number of active participants.

Counting the total number of trade participants seems like the most straightforward approach for defining who our competition is, however not all traders are made equal; as we discussed earlier, some players set low-ball, seldomly-updated Buy orders which don’t pose any meaningful competition in a market.

So, we should filter non-competitive market participants by excluding any orders which have either not been recently placed or updated, or which aren’t remotely trying to meet the current spread on price. We can start by excluding orders which have not been placed or updated in the past two days, while also excluding any orders which are at least 20% lower or higher in price than the respective Buy and Sell orders sitting at the edges of the current spread. Defining these thresholds is objectively speaking pretty arbitrary and might be worth tweaking later, however for now it will serve as a conservative metric which will prevent the most lackadaisical trade behavior from influencing our reporting.

We now need to translate these ideas into queries; our calculation for potential profit needs to be weighed against competition, with competition being defined through order age and price thresholds.

Filtering the Competition

Starting with our order age threshold, the “issued” field in the order book data represents when an order was placed or most recently updated and is presented in the following format: 2024-06-17T03:07:57Z. Since this format is ISO 8601-compliant, we can natively use comparison operators within SQLite to filter chronologically by date.

Since we need to filter any orders placed or updated more than two days ago, we can generate a corresponding timestamp using the datetime and timedelta modules in our runtime, which can be placed after the creation of our concat table and before we calculate potential profit:

	from datetime import datetime, timedelta
	TimeThreshold = datetime.now() - timedelta(days=2)

We can modify this a bit further so that it precisely matches the ISO 8601 formatting of the “issued” field, with trailing microseconds removed:

	from datetime import datetime, timedelta
	TimeThreshold = (datetime.now() - timedelta(days=2)).replace(microsecond=0).isoformat() + 'Z'

Next, we can plug this value into two queries which will count all of the orders which we consider to be competitive for each type_id, taking into account order age and price relative to the current edges of the spread:

	cur.execute(f"CREATE TEMPORARY TABLE TheForge_now_buycompetitors AS SELECT count(TheForge_now.order_id) AS buycompetition_qty, TheForge_now.type_id AS type_id FROM TheForge_now, MaxBuyJita WHERE '{TimeThreshold}' < TheForge_now.issued AND TheForge_now.price > max_buy_price * 0.8 AND TheFOrge_now.is_buy_order = 1 AND TheForge_now.type_id = MaxBuyJita.type_id GROUP BY MaxBuyJita.type_id;")
	cur.execute(f"CREATE TEMPORARY TABLE TheForge_now_sellcompetitors AS SELECT count(TheForge_now.order_id) AS sellcompetition_qty, TheForge_now.type_id AS type_id FROM TheForge_now, MinSellJita WHERE '{TimeThreshold}' < TheForge_now.issued AND TheForge_now.price < min_sell_price * 1.2 AND TheForge_now.is_buy_order = 0 AND TheForge_now.type_id = MinSellJita.type_id GROUP BY MinSellJita.type_id;")

We can then use some LEFT JOIN clauses to create a new table, appending our competitor counts to our existing concat table:

	cur.execute("CREATE TEMPORARY TABLE TheForge_now_concat_competition AS SELECT * FROM TheForge_now_concat LEFT JOIN TheForge_now_buycompetitors ON TheForge_now_concat.type_id = TheForge_now_buycompetitors.type_id LEFT JOIN TheForge_now_sellcompetitors ON TheForge_now_concat.type_id = TheForge_now_sellcompetitors.type_id;")

There are some fields in our table which will have null values for Buy or Sell competition; this occurs when all existing orders at that end of the spread do not meet our threshold for being considered competitive. Since we can be considered an imminent competitor, it is appropriate to increase the total count of competitors by 1, including in all of these null fields:

	cur.executescript("UPDATE TheForge_now_concat_competition SET buycompetition_qty = buycompetition_qty + 1 WHERE buycompetition_qty IS NOT NULL; UPDATE TheForge_now_concat_competition SET buycompetition_qty = 1 WHERE buycompetition_qty IS NULL;")
	cur.executescript("UPDATE TheForge_now_concat_competition SET sellcompetition_qty = sellcompetition_qty + 1 WHERE sellcompetition_qty IS NOT NULL; UPDATE TheForge_now_concat_competition SET sellcompetition_qty = 1 WHERE sellcompetition_qty IS NULL;")

Finally, we can now calculate potential profit, adjusted to now also weigh against competition:

	# calculate potential profit and ROI percentage for all type_ids using historical volume from past day, weighed against Buy vs Sell bias and competition
	cur.execute("CREATE TEMPORARY TABLE TheForge_PotentialProfit AS SELECT volume * (spread / (min_sell_price - max_buy_price)) * min((min_sell_price - average) / buycompetition_qty, (average - max_buy_price) / sellcompetition_qty) AS PotentialDailyProfit, (spread / max_buy_price) * 100 AS PctROI, type_id, typeName AS name FROM TheForge_now_concat_competition ORDER BY PotentialDailyProfit DESC;")

As a result, items which feature more fierce competition will fall further down in our results, with our potential profit figures more accurately representing what a trader can hope to capture in a market given variable participation by competitors:

PotentialDailyProfit	PctROI	type_id	name
3530109559.46758	108.545662933055	14538	Hakim's Modified Gyrostabilizer
1510956819.26646	75.3585421702074	14690	Estamel's Modified Ballistic Control System
372108094.811055	366.181318681319	83382	Uncompromising Scarlet Gloss - Unlimited
361105106.237149	102.551832873473	83065	Sefrim Gold Metallic - Unlimited
345795541.254125	63.3627737226277	52244	Veles Entropic Radiation Sink

After taking some time to review subsequent report results, there are no longer any obvious issues; our report results are now appropriately weighed against all of the factors we’ve identified, including spread, trade volume, Buy versus Sell trade bias, and competition, with the application of basic heuristic methods to infer otherwise missing information.

At this point, since there is nothing obvious left for us to do to improve our reporting schema, our next step will be in-game testing.

Game Testing Observations

No matter how hard we stare at our results, nothing beats real-world testing. So, I signed into EvE Online for the first time in five years to do some station trading in Jita 4-4 to put our methods to work.

After a few weeks of testing and making small reporting tweaks, my experiences were quite mixed; while I managed finding many profitable trade positions, I experienced just as many losses and also found a small portion of my in-game wealth locked into illiquid assets which will take me months or years to offload.

Throughout this testing, I identified three key errors and weaknesses in our reporting schema:

Order Book Spread != Profit

Our reporting uses the magnitude of the spread to calculate potential profit, which we’ve defined as the difference between the minimum Sell and maximum Buy price on the current order book, sans fees. While it is true that an adequate spread on the order book is necessary to ensure that profit can be attained, the magnitude of this spread beyond a certain point has very little to no direct causal link with profitability.

This makes a bit more sense when we consider two hypothetical spread scenarios, assuming the same equilibrium (average) price for an identical good:

Scenario A: Ask Price is $10, Bid Price is $5
Scenario B: Ask Price is $20, Bid Price is $2

According to our reporting, Scenario B will rank higher than Scenario A as it possesses a substantially larger spread. However, in reality there is far less trade going-on at both ends of the spread in Scenario B since fewer traders will capitulate and ‘bite’ at either end as the spread grows wider, resulting in lower trade volume and thus lower potential profit over time.

Players don’t want to get ripped off. While the game mechanics and singular monolithic dark pool presented by the Grand Exchange of OSRS allow us to directly use the spread when calculating profitability in that context, this approach doesn’t translate to EvE; players looking to buy or sell a good can see all open market orders and immediately identify large, predatory spreads. Even if a player isn’t looking at the regional market window when purchasing items, they are prompted with a window displaying the percentage above or below the ethereal “estimated price” the player will receive or spend if a trade is executed at current market prices. While this doesn’t always stop players from accepting the current spread, it will increase their likelihood of engaging in divergent behavior, including seeking better prices in other regions, purchasing alternative goods, or outright destroying the spread by splitting the difference and waiting for a matching offer; none of these outcomes are conducive to profiting from station trading.

Thus, a large spread on the order book does just as much to increase profit potential on a per-unit basis as it does to decrease profit potential by reducing favorable engagement and trade volume.

So, how can we obtain a better gauge of potential profitability if not by the spread? The answer lies in the market history endpoint; in-addition to providing a daily average, there are also values for “lowest” and “highest” indicating the lowest and highest prices executed against the target type_id for the reported day. Since these values represent trades which have successfully cleared, they provide a far better gauge for potential profitability than open market orders which have not and may never clear.

I was able to address this shortcoming in my reporting by slightly redefining the “real spread”. Instead of it simply being the difference between the maximum Buy and minimum Sell price on the current order book, I also calculated the difference between the “lowest” and “highest” prices in recent market history. The minimum between these two values then became our “real spread”, which helped moderate potential profit calculations by preventing over-estimation.

Competition is Complicated

While trading, I found that the dynamics of competition are far more complicated than our model leads us to believe. Our heuristic is better than nothing but is woefully deficient in articulating how competing players impact the profitability of station trading.

First, I noted that the direct impact of competition on station trading is not actually a function of market participants, but rather the frequency with which orders are added and relisted. It doesn’t matter if there are 2 or 20 competitors; a single determined trader who cuts their way to the edge of the spread every 5 minutes is going to have a far greater impact on profitability than any number of lazy traders who update their orders every other day.

Next, in circumstances where repeated re-listing might be necessary to stay in the game, relist fees need to be taken into account. While these fees are small, if competition is fierce, they can add-up and severely cut into profit margins. In one circumstance I accrued hundreds of millions of ISK worth of relist fees engaging with another trader in a prolonged spat over access to the spread for a high-value asset.

An additional factor to consider is shrinkage of the spread resulting from competition; an item which possesses profit potential now can become totally unprofitable in a few hours if fierce competition destroys profit margins faster than anybody can actually move volume.

While I attempted to address some of these shortcomings within our existing report framework, I quickly found that a substantial overhaul of our script would be required to adequately cover all of these factors related to competition.

Recency Bias

Through my testing, I entered positions on many high-cost assets which had recently experienced a surge in trade volume. I later determined after checking historical data that many of these assets suffer from poor longer-term trade volume. While there is a potential I may make a decent return on investment for these items, this forecast is on a schedule far beyond what my reporting originally lead me to believe. During this prolonged waiting period, competitors have a large window of time to under-cut my sell orders, placing me and my illiquid assets in a vulnerable position.

The underlying phenomenon in our reporting causing this undesirable outcome is a form of recency bias. Since our current reporting schema uses an extremely limited scope (1 day) of historical market data to generate results, it fails to account for longer-term trends in trade volume. While it is important for our reporting to maintain some degree of recency, balance needs to be struck to avoid generating results which possess an extreme bias favoring only very recent trade activity.

While I was able to address this somewhat by interpolating results from multiple days of trade history, it became clear to me that, like accounting for competition, a broader overhaul would be needed to systemically address this concern in our reporting.

Game Testing Conclusions

Despite appearing adequate at a surface-level, conducting in-game testing revealed some major issues with our reporting assumptions and technique. As a result, we will start from scratch and use what we’ve learned to come-up with a plan which addresses our new scope of concerns.

Station Trading: 2nd Attempt

Setting New Goals

Based off of the failures of our first attempt, our new approach needs to take all of the following into account:

Profit (Volume * Spread)
Trade bias (Buy vs. Sell)
More factors of competition
Recency bias and consideration of trade factors over a broader span of time

The main difference between our new and old reporting schemas will involve the use of time; while our first attempt looked at history for a single day or a single snapshot of the current orderbook, we now need to tap into the historical data we’ve been collecting in-order to consider changes over-time in our analysis.

In the past when reporting for OSRS, we leveraged historical data to generate mean/median statistics to establish baseline price and volume expectations. While this worked well for us in that context, for EvE Online we’re going to take a different approach.

Linear Regression Analysis

Trading in EvE Online takes time; between the time we place a Buy offer and see it cleared, things will have changed. This is likewise the case when we place and wait for a corresponding Sell order to clear. The problem with this latency is that it comes with any number of market changes; an increase in order change frequency or an extreme narrowing of the spread can put us upside-down on our speculative trading. A reduction in overall trade volume over-time can likewise see us holding onto a position far longer than desired.

We can address these concerns involving latency and recency by using linear regression analysis. With linear regression, we can take any specific variable such as price, volume or order age and plot it over time. The resulting least-squares slope provides a means for predicting where the variable will go in the future. While this approach has its limitations and cannot be fully relied upon to influence our trade activity, it will help us better validate our results and aid us in filtering-out potentially problematic trade scenarios.

We haven’t used linear regression modeling up-until now, mostly because it wasn’t necessary when developing strategies for OSRS. In that game, there is a larger body of players vying for a smaller variety of goods all within a single, monolithic dark pool. Such a market lends itself better to reactive strategies which capitalize on the detection of market deviations. In the case of EvE where transaction costs are far higher and markets are slower to react to change, mean and median values are less useful than the monitoring and analysis of trends and change over-time.

With the data we have at our disposal, we should find no trouble performing regression analysis on any number of trade factors in-order to build a more comprehensive station trading strategy.

Outlining our Steps

Before we get started building a new script, let’s draw a rough outline of what our new approach needs to incorporate:

For Volume, we need to use the market history endpoint to determine:

Proportion of orders Bought vs orders Sold (Buy vs. Sell bias)
Regression analysis of order count, providing a metric for trade frequency
Regression analysis of total volume, providing one component for determining profit per day

For Price, we need to use the market history endpoint to determine:

Regression analysis of ‘lowest’ and ‘highest’ prices, capturing their corresponding values at the time where we expect each transaction to occur
The difference between these prices, sans fees is our ‘real spread’ which represents profit per unit, serving as the second component after volume in determining profit per day

For Competition, we need to use the order book endpoint to determine:

Regression analysis of the maximum ‘issued’ timestamp for both Buy and Sell orders, indicating the frequency of order placement and change. This will be used to define our relist fees and moderate our projected trade frequency given the impact of competition.
Regression analysis of the maximum Buy order ‘price’ and minimum Low order ‘price’, indicating movement of the listed prices placed on the open order book. This will be used to moderate the profit potential expectations derived from the market history endpoint and ensure that our projections take into account movement on the market.

The data resulting from these steps can then be combined to provide a more holistic estimation of potential profit per day.

So, let’s get started!

Data Normalization

We’ll start by importing some modules and setting some global variables. These values will help define the quality of data we want from our reports and will be incorporated into our future queries:


### modules
import sqlite3
import pandas as pd
import os
from datetime import date, datetime, timedelta, timezone
import numpy as np
import statistics
import time
import pytz
from sklearn.linear_model import LinearRegression

### define regions to report against
regionlist = ['TheForge', 'Domain', 'Essence', 'Metropolis', 'SinqLaison', 'Heimatar', 'TashMurkon']

### define database location where source data is stored
databaselocation = 'marketdata.sqlite'

### change working directory to script location
pathtorunfile = os.path.dirname(__file__)
os.chdir(pathtorunfile)

### specify in days the span of time used to build our linear regressions
HistoryRegressionSpan = 30
OrderbookRegressionSpan = 1

### specify maximum transaction delta (days). Type_ids will be filtered from our results if the time to clear a Buy and Sell order exceeds this value
MaxTransactionDays = 7

### specify minimum ROI (%). Type_ids will be filtered from our results if ROI falls below this percentage
MinROI = 10

### specify minimum ISK profit per day. Type_ids will be filtered from our results in daily potential profit falls below this quantity
MinProfitPerDay = 1000000

...

### runtime
if __name__ == '__main__':

	for region in regionlist:

Next, in our runtime, we’ll establish our database connections, including an in-memory database for queries performed during this session, as well as a read-only connection to the full database of ESI data we’ve been building using the script from our last article:

## create in-memory database to temporarily store tables used for report generation	
tempdb = sqlite3.connect(":memory:")
cur = tempdb.cursor()

## attach primary database (READ-ONLY)
cur.execute(f"ATTACH 'file:{databaselocation}' AS marketdatadb;")

Our next set of queries include some data normalization routines. We need timestamp data to serve as the X-axis for our regression analysis, however our historical endpoint data makes use of simple dates which need to be converted. Furthermore, dates which have experienced no cleared orders are not represented in our sample and will need to be populated:

## extract unix time from historical data
HistoryUnixTime = pd.read_sql(f"SELECT max(timestamp) as timestamp from {region}History;", tempdb)
HistoryUnixTime = int((HistoryUnixTime['timestamp'].to_list())[0])

## extract unix time from order book data
OrderbookUnixTime = pd.read_sql(f"SELECT DISTINCT max(timestamp) AS maxtimestamp FROM {region};", tempdb)
OrderbookUnixTime = int((OrderbookUnixTime['maxtimestamp'].to_list())[0])

## generate a table containing all unique type_ids and their corresponding minimum date
cur.execute(f"CREATE TABLE TypeIDMinDates AS SELECT type_id, min(date) AS MinDate FROM {region}History GROUP BY type_id;")

## populate HistorySpanDates with dates indicated by HistoryRegressionSpan
cur.execute(f"CREATE TABLE HistorySpanDates AS SELECT DISTINCT date FROM marketdatadb.{region}History ORDER BY date DESC LIMIT {HistoryRegressionSpan};")

## add unix timestamps to HistorySpanDates, which will later serve as the X axis for our linear regressions
HistorySpanDates = pd.read_sql(f"SELECT date FROM HistorySpanDates", tempdb)
HistorySpanDates = list(HistorySpanDates.date)
DateTimestampList = []
DateTimestamp = HistoryUnixTime
for HistoryDate in HistorySpanDates:
	DateTimestampList.append([HistoryDate, DateTimestamp])
	DateTimestamp = DateTimestamp - 86400
DateTimestampList = pd.DataFrame(DateTimestampList, columns = ['date', 'timestamp'])
DateTimestampList.to_sql(f"HistorySpanDates", tempdb, if_exists="replace", index=True)

## generate a cartesian product of all unique type_ids and their span of dates greater than MinDate. Dates less than MinDate are omitted to remove bias against items which have been very recently introduced into the game. Likewise, since we captured MinDate for all history, we're preventing bias favoring items which have been in the game for a long time and were traded recently, but which are otherwise seldomly traded.
cur.execute(f"CREATE TABLE TypeIDAllDates AS SELECT * FROM TypeIDMinDates CROSS JOIN HistorySpanDates WHERE HistorySpanDates.date >= TypeIDMinDates.MinDate;")

## populate HistorySpan with data corresponding with the dates indicated by HistorySpanDates
cur.execute(f"CREATE TABLE HistorySpan AS SELECT average, {region}History.date AS date, highest, lowest, volume, order_count, type_id FROM {region}History, HistorySpanDates WHERE {region}History.date = HistorySpanDates.date;")

## left join our HistorySpan table to the TypeIDdates cartesian product. Dates greater than MinDate containing no data will carry null values for now.
cur.execute(f"CREATE TABLE HistorySpanAllDates AS SELECT TypeIDAllDates.type_id AS type_id, TypeIDAllDates.date AS date, TypeIDAllDates.timestamp AS timestamp, average, highest, lowest, volume, order_count, MinDate FROM TypeIDAllDates LEFT JOIN HistorySpan ON TypeIDAllDates.type_id = HistorySpan.type_id AND TypeIDAllDates.date = HistorySpan.date ORDER BY TypeIDAllDates.type_id;")

The next set of data normalization routines all serve to correct Buy vs Sell bias, ultimately providing us with BuyVolume, BuyOrders, SellVolume and SellOrders values which will later be used in the creation of linear regressions:

## extrapolate BuyVolume, BuyOrders, SellVolume and SellOrders fields by checking the relative distance between our highest/average/lowest prices. We cannot infer the direction of bias in circumstances where highest = lowest, so in these circumstances volume is counted as BiasedVolume
cur.executescript(f"DROP TABLE IF EXISTS HistorySpan; ALTER TABLE HistorySpanAllDates RENAME TO HistorySpan; ALTER TABLE HistorySpan ADD COLUMN BiasedVolume; ALTER TABLE HistorySpan ADD COLUMN BuyVolume; ALTER TABLE HistorySpan ADD COLUMN SellVolume; ALTER TABLE HistorySpan ADD COLUMN BuyOrders; ALTER TABLE HistorySpan ADD COLUMN SellOrders;")
cur.execute(f"UPDATE HistorySpan SET BuyVolume = volume * (highest - average) / (highest - lowest), BuyOrders = order_count * (highest - average) / (highest- lowest), SellVolume = volume * (average - lowest) / (highest - lowest), SellOrders = order_count * (average - lowest) / (highest - lowest) WHERE (highest > average OR average > lowest);")
cur.execute(f"UPDATE HistorySpan SET BiasedVolume = volume WHERE highest = lowest;")

## infill any volume-related NULL values with 0
cur.executescript(f"UPDATE HistorySpan SET BiasedVolume = 0 WHERE BiasedVolume IS NULL; UPDATE HistorySpan SET BuyVolume = 0 WHERE BuyVolume IS NULL; UPDATE HistorySpan SET SellVolume = 0 WHERE SellVolume IS NULL; UPDATE HistorySpan SET BuyOrders = 0 WHERE BuyOrders IS NULL; UPDATE HistorySpan SET SellOrders = 0 WHERE SellOrders IS NULL; UPDATE HistorySpan SET order_count = 0 WHERE order_count IS NULL;")

## create HistorySpanSum table containing sums for all of our volume types
cur.execute(f"CREATE TABLE HistorySpanSum AS SELECT type_id, sum(BiasedVolume) AS BiasedVolumeSum, sum(BuyVolume) AS BuyVolumeSum, sum(SellVolume) AS SellVolumeSum FROM HistorySpan GROUP BY type_id;")
	
## purge any type_ids which are 100% biased in either Buy or Sell direction
cur.execute(f"DELETE FROM HistorySpanSum WHERE BuyVolumeSum = 0 OR SellVolumeSum = 0;")

## calculate the ratio between buying and selling activity for all days where bias can be inferred. 
cur.executescript(f"ALTER TABLE HistorySpanSum ADD COLUMN BuySellRatio; UPDATE HistorySpanSum SET BuySellRatio = BuyVolumeSum / (BuyVolumeSum + SellVolumeSum);")

## use BuySellRatio to extrapolate volume and order count for all days where bias could not be inferred. The result is our normalized set of historical market data
cur.execute(f"CREATE TABLE HistorySpanNrml AS SELECT BuySellRatio, HistorySpanSum.type_id AS type_id, date, timestamp, average, highest, lowest, volume, order_count, Mindate, BiasedVolume, BuyVolume, SellVolume, BuyOrders, SellOrders FROM HistorySpanSum LEFT JOIN HistorySpan ON HistorySpanSum.type_id = HistorySpan.type_id;")
cur.executescript(f"DROP TABLE HistorySpan; DROP TABLE HistorySpanSum;")
cur.executescript(f"UPDATE HistorySpanNrml SET BuyVolume = BuySellRatio * BiasedVolume WHERE BiasedVolume > 0; UPDATE HistorySpanNrml SET BuyOrders = BuySellRatio * (BiasedVolume / order_count) WHERE BiasedVolume > 0; UPDATE HistorySpanNrml SET SellVolume = (1 - BuySellRatio) * BiasedVolume WHERE BiasedVolume > 0; UPDATE HistorySpanNrml SET SellOrders = (1 - BuySellRatio) * (BiasedVolume / order_count) WHERE BiasedVolume > 0;")

The resulting table is comprised of about 334,000 rows or about 50MB of normalized historical market endpoint data. The quantity of data contained in this table is directly influenced by the HistoryRegressionSpan variable defined at the start of our script which specifies the number of days used for regression analysis against this data set. Currently this is set to 30, which strikes a balance between containing enough data to be statistically useful for regression analysis, while not being so broad as to make our regression calculations slow to react to recent market changes.

With a solid set of normalized historical market data, we can get started churning-out some linear regressions.

Calculating Regressions

While it is theoretically possible to natively use SQLite to calculate the sum of least squares and perform linear regression analysis, in my opinion it isn’t the best tool for the job.

For this exercise, we’ll be using the LinearRegression method provided by the scikit-learn library:

from sklearn.linear_model import LinearRegression

We’ll start by creating some Pandas dataframes and a list derived from our SQLite tables:

## save current set of unique type_ids as a dataframe and list from HistorySpanNrml
TypeID_df = pd.read_sql("SELECT DISTINCT type_id FROM HistorySpanNrml", tempdb)
TypeID_list = list(TypeID_df.type_id)

## save all of our normalized data as a HistoryData dataframe from HistorySpanNrml
HistoryData = pd.read_sql("SELECT * FROM HistorySpanNrml;", tempdb)

Which we will then feed into some linear regression functions, with the results being saved to a new ‘regressions’ dataframe.

## linear regression function; returns a LinearRegression() object given a dataframe "HistoryData", X-value and Y-value.
def lrmodel(HistoryData, yvalue):
	y = HistoryData[[yvalue]].values
	X = HistoryData[['timestamp']].values
	return LinearRegression().fit(X, y)

## stages/groups data before feeding it into lrmodel function.	
def lrstaging(HistoryData, yvalue):
	NewRegression = HistoryData.dropna().groupby('type_id').apply(lrmodel, yvalue, include_groups=False).reset_index()
	NewRegression.columns = ['type_id', yvalue]
	return(NewRegression)

...
### runtime
...

## calculate regression slopes for BuyOrders and SellOrders, save results to regressions dataframe
regressions = lrstaging(HistoryData, 'BuyOrders')
regressions = pd.merge(regressions, lrstaging(HistoryData, 'SellOrders'), how="right", on=["type_id", "type_id"])

The resulting ‘regressions’ dataframe looks like this:

       type_id           BuyOrders          SellOrders
0           18  LinearRegression()  LinearRegression()
1           19  LinearRegression()  LinearRegression()
2           20  LinearRegression()  LinearRegression()
3           21  LinearRegression()  LinearRegression()
4           22  LinearRegression()  LinearRegression()
...        ...                 ...                 ...
11207    84009  LinearRegression()  LinearRegression()
11208    84012  LinearRegression()  LinearRegression()
11209    84014  LinearRegression()  LinearRegression()
11210    84015  LinearRegression()  LinearRegression()
11211    84016  LinearRegression()  LinearRegression()

[11212 rows x 3 columns]

## executiion time: 14.65 seconds

Each LinearRegression() object represents a slope, which can be plotted against the source data in familiar-looking charts:

More importantly, these regression objects can be used to facilitate Y-axis predictions given a target timestamp on the X axis:

## Estimate 'BuyOrders' given current timestamp
BuyOrderFreq = regressions.loc[regressions["type_id"] == 84016, ['BuyOrders']].values[0][0].predict([[OrderbookUnixTime]])[0][0]
print(BuyOrderFreq)
... 3.8284987

## Estimate 'BuyOrders' given current timestamp, plus one day
BuyOrderFreq = regressions.loc[regressions["type_id"] == 84016, ['BuyOrders']].values[0][0].predict([[OrderbookUnixTime + 86400]])[0][0]
print(BuyOrderFreq)
... 2.64221781

By locating a LinearRegression() object on the regressions dataframe corresponding with a specific type_id and column, we can use .predict given a specified unix timestamp to deliver a prediction. In the case of type_id 84016, our regression indicates 3.8 Buy orders being executed per day as-of today. This is then projected to decrease to 2.64 Buy orders at this time tomorrow. The inference we can then make is that it is becoming more difficult to successfully obtain units of this commodity through a Buy order as time continues.

While calculating 22,000 regressions is pretty computationally expensive at about 15 seconds of runtime on my machine, as we progress through our script, we’ll use this predictive analysis to filter-out type_ids possessing undesirable qualities per our starting variables which will permit subsequent regression and other data modeling computations to consume fewer CPU cycles.

Buy/Sell Delta Modeling

With regressions now calculated and stored for BuyOrders and SellOrders, we can get started modeling trade frequency. Our goal here will be to calculate the approximate time (in seconds) it will take to clear both a Buy and Sell order which will be referred to as our ‘BuyOrderDelta’ and ‘SellOrderDelta’ respectively, with the sum of these values providing the total station trading transaction time. Since profit is a function of time, having these values handy will prove invaluable in weighing the profit potential between TypeIDs.

Let’s start with a single line in our runtime:

## feed TypeID_list and regressions data into the Buy/Sell Delta model for data processing
ResultData = pd.DataFrame((miscstaging(TypeID_list, regressions, bsdeltamodel)), columns=['type_id', 'BuyOrderDelta', 'SellOrderDelta'])

The purpose of this line is simple; feed a function regression data, and receive deltas in-return. Let’s take a look next at the first functions it calls, ‘miscstaging’:

## stages/groups regression data before feeding it into the other modeling functions
def miscstaging(TypeID_list, InputData, ModelType):
	IterResult_list = []
	for TypeID in TypeID_list:
		IterResult = ModelType(TypeID, InputData)
		if IterResult != 'none':
			IterResult_list = IterResult_list + IterResult
	return(IterResult_list)

This is a pretty generic function which feeds a single TypeID at a time into a subsequent data modeling function, returning the list of TypeIDs which provided valid results. Let’s move-on to ‘bsdeltamodel’ next, with important comments highlighted in bold:

## calculate and return BuyOrder and Sellorder deltas
def bsdeltamodel(TypeID, InputData):
	# calculating the expected frequency of Buy orders at this moment
	BuyOrderFreq = InputData.loc[InputData["type_id"] == TypeID, ['BuyOrders']].values[0][0].predict([[OrderbookUnixTime]])[0][0]
	# If the frequency is greater than zero, dividing 1 by the frequency will give us the current number of days it will take for a Buy order to be executed, converted into a unixtime delta by multiplying by the number of seconds in a day.
	if BuyOrderFreq > 0:
		BuyOrderDelta = (1 / (BuyOrderFreq)) * 86400
	# Since the frequency changes over-time according to the slope of the regression, we also need to take this change into account when calculating our BuyOrder delta. We can do this by taking the average between our frequency now vs the expected frequency in the future. (This isn't a perfect mechanism, but provides a better approximation of our anticipated buy order delta than simply using the result from the current timestamp) 
		FutureBuyOrderFreq = InputData.loc[InputData["type_id"] == TypeID, ['BuyOrders']].values[0][0].predict([[OrderbookUnixTime + BuyOrderDelta]])[0][0]
	else:
		return('none')
	if FutureBuyOrderFreq > 0:
		BuyOrderDelta = (2 / (BuyOrderFreq + FutureBuyOrderFreq)) * 86400
	# This value can then be used to calculate the Sell Order frequency and initial delta...
		SellOrderFreq = InputData.loc[InputData["type_id"] == TypeID, ['SellOrders']].values[0][0].predict([[OrderbookUnixTime + BuyOrderDelta]])[0][0]
	else:
		return('none')
	if SellOrderFreq > 0:
		SellOrderDelta = (1 / (SellOrderFreq)) * 86400
	# ... which also needs to account for its frequency changing over-time
		FutureSellOrderFreq = InputData.loc[InputData["type_id"] == TypeID, ['SellOrders']].values[0][0].predict([[OrderbookUnixTime + BuyOrderDelta + SellOrderDelta]])[0][0]
	else:
		return('none')
	if FutureSellOrderFreq > 0:
		SellOrderDelta = (2 / (SellOrderFreq + FutureSellOrderFreq)) * 86400
	else:
		return('none')
	# If the combined BuyOrderDelta and SellOrderDelta is within our specified MaxtransactionDays, it will be returned to the parent thread.
	if BuyOrderDelta + SellOrderDelta < MaxTransactionDays * 86400:
		return[[TypeID, BuyOrderDelta, SellOrderDelta]]
	return('none')

This function not only serves to calculate our BuyOrder and SellOrder deltas for each TypeID, but it also effectively filters any items experiencing collapsed supply from the returned results. In our case, if a particular item type will take more than 7 days to be purchased and re-sold as-defined by MaxTransactionDays, it is discarded from the results. We can see this in our ResultData, which features about 900 fewer TypeIDs than the source data:

       type_id  BuyOrderDelta  SellOrderDelta
0           18      22.332095       25.151608
1           19     437.609298     7669.424030
2           20     188.135661       86.650253
3           21    1660.470506       40.557074
4           22     473.664503      594.775586
...        ...            ...             ...

[10344 rows x 3 columns]

Setting a time threshold is important given the limitations of linear regression analysis; as time moves forward, our forecasting suffers greater error and it becomes prudent to omit results which depend on projections reaching too far into the future. While we could remediate this by expanding our seed data from the market history endpoint beyond 30 days, this would come at the expense of recency which we need to maintain for effective station trading; striking a balance between these different factors is something to consider with all data modeling.

Since our ResultData now contains fewer TypeIDs, we will continue progressing through our runtime by purging the omitted TypeIDs from our other data sets before moving-on to our next set of regressions:

## refresh TypeID list and purge unwanted TypeIDs from HistoryData and regressions
TypeID_df = pd.DataFrame(ResultData.type_id.unique(), columns=['type_id'])
TypeID_list = list(TypeID_df.type_id)
HistoryData = pd.merge(HistoryData, TypeID_df, how="right", on=["type_id", "type_id"])
regressions = pd.merge(regressions, TypeID_df, how="right", on=["type_id", "type_id"])

Price Regressions and Modeling

The next regressions will be made against the ‘lowest’ and ‘highest’ prices indicated in our market history. We can calculate these by specifying the right variables and calling our lrmodel function again:

## calculate regression slopes for highest and lowest prices present in HistoryData
regressions = pd.merge(regressions, lrstaging(HistoryData, 'lowest'), how="right", on=["type_id", "type_id"])
regressions = pd.merge(regressions, lrstaging(HistoryData, 'highest'), how="right", on=["type_id", "type_id"])

Merging the results allows our ‘regressions’ dataframe to serve as a growing repository of regression slopes to service any future use. After calculating our ‘lowest’ and ‘highest’ regressions, we can perform some more data modeling:

## calculate per-unit BuyPrice/SellPrice/spread using our regression data and BuyOrder/SellOrder deltas
InputData = pd.merge(ResultData[['type_id','BuyOrderDelta','SellOrderDelta']],regressions[['type_id','lowest','highest']],on='type_id', how='left')
OutputData = pd.DataFrame((miscstaging(TypeID_list, InputData, bsspreadmodel)), columns=['type_id', 'BuyPrice', 'SellPrice', 'RealSpread'])
ResultData = pd.merge(ResultData, OutputData, how="right", on=["type_id", "type_id"])

Which calls the ‘bsspreadmodel’ function:

## calculate and return SellPrice and BuyPrice, given BuyOrder and SellOrder deltas
def bsspreadmodel(TypeID, InputData):
	# Grab BuyOrderDelta and SellOrderDelta values 
	BuyOrderDelta = InputData.loc[InputData["type_id"] == TypeID, ['BuyOrderDelta']].values[0][0]
	SellOrderDelta = InputData.loc[InputData["type_id"] == TypeID, ['SellOrderDelta']].values[0][0]
	# Plug these values into regression slope for BuyPrice (lowest) and SellPrice (highest)
	BuyPrice = InputData.loc[InputData["type_id"] == TypeID, ['lowest']].values[0][0].predict([[OrderbookUnixTime + BuyOrderDelta]])[0][0]
	SellPrice = InputData.loc[InputData["type_id"] == TypeID, ['highest']].values[0][0].predict([[OrderbookUnixTime + BuyOrderDelta + SellOrderDelta]])[0][0]
	# Spread is Sell - Buy - Fees
	RealSpread = SellPrice - BuyPrice - ((BuyPrice * 0.015) + (SellPrice * 0.051))
	# Only return result if there is more than zero profit potential and if ROI is greater than our set global threshold
	if RealSpread > 0 and SellPrice > 0 and (RealSpread / ((BuyPrice * 1.015) + (SellPrice * 0.051))) * 100 > MinROI:
		return[[TypeID, BuyPrice, SellPrice, RealSpread]]
	else:
		return('none')

If we print ResultData now, we’ll see that our scope of TypeIDs has narrowed even further, with about 7,000 items having been filtered-out for providing no returns or inadequate ROI:

      type_id  BuyOrderDelta  SellOrderDelta      BuyPrice     SellPrice    RealSpread
0          41      25.753123       21.042844  3.237464e+01  6.761021e+01  3.130184e+01
1          45    6243.415450    14385.714468  4.261559e+01  1.102067e+02  6.133130e+01
2         185     172.553692       55.942094  4.112389e+01  6.182883e+01  1.693481e+01
3         189      71.734504       59.853684  5.811641e+01  7.104128e+01  8.430010e+00
4         200      27.026451       32.162636  2.168323e+02  2.613782e+02  2.796315e+01
...       ...            ...             ...           ...           ...           ...

[3324 rows x 6 columns]

And with some more clean-up, we’re ready to move onto the next set of regression calculations and data modeling routines:

## refresh TypeID list and purge unwanted TypeIDs from HistoryData and regressions
TypeID_df = pd.DataFrame(ResultData.type_id.unique(), columns=['type_id'])
TypeID_list = list(TypeID_df.type_id)
HistoryData = pd.merge(HistoryData, TypeID_df, how="right", on=["type_id", "type_id"])
regressions = pd.merge(regressions, TypeID_df, how="right", on=["type_id", "type_id"])

Volume Regression and Modeling

While the order count value provides us with a means of measuring trade frequency, volume provides unit quantity. While there are many items which feature nearly-matching order count and volume, many in-game commodities such as ammunition are typically bought and sold by the tens, hundreds, or thousands of units per order. Taking a regression of volume per order will allow us to estimate how many units we can expect to exchange in a typical trade, which will serve as an important factor when calculating potential profit.

We’ll start by creating a new column, ‘VolPerOrder” defined as volume divided by order count, which we will then feed into our linear regression function. The resulting regression slopes along with some of our past result data will then be fed into our next modeling function:

## calculate VolPerOrder and generate a regression against this value
HistoryData['VolPerOrder'] = HistoryData['volume'] / HistoryData['order_count']
regressions = pd.merge(regressions, lrstaging(HistoryData, 'VolPerOrder'), how="right", on=["type_id", "type_id"])

## use VolPerOrder regression to calculate potential daily profit and filter low-yield type_ids
InputData = pd.merge(ResultData,regressions[['type_id','VolPerOrder']],on='type_id', how='left')
OutputData = pd.DataFrame((miscstaging(TypeID_list, InputData, potentialprofitmodel)), columns=['type_id', 'PotentialDailyProfit', 'VolPerOrder'])
ResultData = pd.merge(ResultData, OutputData, how="right", on=["type_id", "type_id"])

The ‘potentialprofitmodel’ function looks like this:

## calculate and return ProfitPerDay given anticipated unit volume traded per order
def potentialprofitmodel(TypeID, InputData):
	# Grab BuyOrderDelta, SellOrderDelta, and RealSpread values.
	BuyOrderDelta = InputData.loc[InputData["type_id"] == TypeID, ['BuyOrderDelta']].values[0][0]
	SellOrderDelta = InputData.loc[InputData["type_id"] == TypeID, ['SellOrderDelta']].values[0][0]
	RealSpread = InputData.loc[InputData["type_id"] == TypeID, ['RealSpread']].values[0][0]
	# Use VolPerOrder regression to determine unit volume obtained at the time of purchase
	VolPerOrder = InputData.loc[InputData["type_id"] == TypeID, ['VolPerOrder']].values[0][0].predict([[OrderbookUnixTime + BuyOrderDelta]])[0][0]
	# Calculate profit per order and profit per day.
	ProfitPerOrder = RealSpread * VolPerOrder
	PotentialDailyProfit = (ProfitPerOrder / (BuyOrderDelta + SellOrderDelta)) * 86400
	# If profit per day exceeds our global threshold, pass the result.
	if PotentialDailyProfit > MinProfitPerDay:
		return[[TypeID, PotentialDailyProfit, VolPerOrder]]
	else:
		return('none')

The resulting data, once again, has narrowed further:

      type_id  BuyOrderDelta  SellOrderDelta  ...     SellPrice    RealSpread    PotentialDailyProfit
0          41      25.753123       21.042844  ...  6.761021e+01  3.130184e+01   575285720.3007993
1         185     172.553692       55.942094  ...  6.182883e+01  1.693481e+01  26568971.437229034
2         189      71.734504       59.853684  ...  7.104128e+01  8.430010e+00  18369579.268516026
3         200      27.026451       32.162636  ...  2.613782e+02  2.796315e+01   380200032.8467175
4         206     295.960658      136.549795  ...  5.653936e+01  1.076786e+01   8890505.339150857
...       ...            ...             ...  ...           ...           ...                     ...

[1778 rows x 7 columns]

With some additional clean-up, we are now ready to get started moving-on to our orderbook data in-order to continue our regression analysis and modeling:

## refresh TypeID list and purge unwanted TypeIDs from HistoryData and regressions
TypeID_df = pd.DataFrame(ResultData.type_id.unique(), columns=['type_id'])
TypeID_list = list(TypeID_df.type_id)
HistoryData = pd.merge(HistoryData, TypeID_df, how="right", on=["type_id", "type_id"])
regressions = pd.merge(regressions, TypeID_df, how="right", on=["type_id", "type_id"])

Competition Regressions and Modeling

Up-until this point, we have been working solely with the historical market endpoint data. There has been good reason for this; with our variable OrderbookRegressionSpan set to 1 (day) and with 20GB of order book data spanning 10 days at our disposal, accessing the order book for all TypeIDs would involve parsing 2GB of data. This is in-contrast to the historical market endpoint data which only comprises 50MB of data given the 30 days defined by the variable HistoryRegressionSpan. The result is that it is far more efficient to start our data modeling process with the smaller dataset in-order to narrow our scope of desired TypeIDs before moving-on to the more granular and detailed data provided by the order book endpoint. Instead of parsing 2GB of data, since our scope has narrowed to just 1/10th the total TypeIDs, we now only need to parse approximately 200MB of data. By adhering to this order of operations, we can help cut-down our script runtime, permitting more rapid delivery of new reports.

We rely on the ‘issued’ field provided by the order book data in-order to model the impact of competition on station trading, ‘issued’ indicates the precise date/time a unique order ID on the order book was either first issued or re-listed on the market. The difference between ‘OrderbookUnixTime’ (current time) and ‘issued’ is the ‘IssuedDelta’. A small IssuedDelta is indicative of strong competition, while a large IssuedDelta is indicative of weak competition.

Furthermore, the orderbook differentiates between Buy and Sell orders, so these deltas need to be separated into ‘BuyIssuedDelta’ and ‘SellIssuedDelta’ respectively. The following set of routines accomplishes all of our desired data handling, returning two dataframes containing our IssuedDeltas:

## save TypeID list as sql temp table
TypeID_df.to_sql(f"typeid_list", tempdb, if_exists="replace", index=True)
## create dataframe containing the most recently issued order for each unique combination of timestamp, type_id and is_buy_order. Range of timestamps is dictated by OrderbookRegressionSpan (in days)
OrderbookMaxIssued = pd.read_sql(f"SELECT timestamp, {region}.type_id AS type_id, max(issued) AS maxissued, is_buy_order FROM {region}, typeid_list WHERE typeid_list.type_id = {region}.type_id AND timestamp >= {OrderbookUnixTime} - ({OrderbookRegressionSpan} * 86400) GROUP BY timestamp, {region}.type_id, is_buy_order", tempdb)
## convert datetime to unix timestamp
OrderbookMaxIssued['maxissued'] = pd.to_datetime(OrderbookMaxIssued['maxissued'])
OrderbookMaxIssued['maxissued'] = OrderbookMaxIssued['maxissued'].astype(int)
OrderbookMaxIssued['maxissued'] = OrderbookMaxIssued['maxissued'].div(10**9)
## IssuedDelta is the approximate time (in seconds) since the most recent order was placed, which we will use for the Y axis coordinate in our competition regression.
OrderbookMaxIssued['maxissued'] = OrderbookMaxIssued['timestamp'] - OrderbookMaxIssued['maxissued']
		
## split, rename IssuedDelta columns
OrderbookMaxIssuedSell = OrderbookMaxIssued[OrderbookMaxIssued.is_buy_order == 0]
OrderbookMaxIssuedSell.columns = OrderbookMaxIssuedSell.columns.str.replace('maxissued', 'SellIssuedDelta')
OrderbookMaxIssuedBuy = OrderbookMaxIssued[OrderbookMaxIssued.is_buy_order == 1]
OrderbookMaxIssuedBuy.columns = OrderbookMaxIssuedBuy.columns.str.replace('maxissued', 'BuyIssuedDelta')

Since IssuedDeltas can change over-time, with competition growing stronger or weaker as interest grows or wanes in the market of particular goods, we should build yet another set of regression slopes to use in our modeling:

## generate regressions for IssuedBuy and IssuedSell deltas
regressions = pd.merge(regressions, lrstaging(OrderbookMaxIssuedSell, 'SellIssuedDelta'), how="left", on=["type_id", "type_id"])
regressions = pd.merge(regressions, lrstaging(OrderbookMaxIssuedBuy, 'BuyIssuedDelta'), how="left", on=["type_id", "type_id"])
				
## clear OrderbookMax values from memory
del OrderbookMaxIssued
del OrderbookMaxIssuedSell
del OrderbookMaxIssuedBuy

Once the regression slopes are built, we can purge the corresponding dataframes from memory as they’ll no longer be needed. We’ll then feed these and other regressions, along with some existing data into a new modeling function:

## use BuyIssued and SellIssued regressions to recalculate BuyOrderDelta, SellOrderDelta, lowest and highest prices, taking into account the crowding effect of competition.
InputData = pd.merge(regressions[['type_id','SellIssuedDelta','BuyIssuedDelta','lowest','highest']],ResultData[['type_id','BuyOrderDelta','SellOrderDelta']],on='type_id', how='left')
OutputData = pd.DataFrame((miscstaging(TypeID_list, InputData, issuedmodel)), columns=['type_id', 'BuyOrderDelta', 'SellOrderDelta', 'BuyPrice', 'SellPrice', 'RealSpread', 'BuyRelistCount', 'SellRelistCount'])

The ‘issuedmodel’ function called by this block looks like this:

## calculate and return revised BuyOrderDelta, SellOrderDelta, BuyPrice, SellPrice and RealSpread, given incumbant competition vying for the spread.
def issuedmodel(TypeID, InputData):
	# Grab OrderDeltas and IssuedReg values
	BuyOrderDelta = InputData.loc[InputData["type_id"] == TypeID, ['BuyOrderDelta']].values[0][0]
	SellOrderDelta = InputData.loc[InputData["type_id"] == TypeID, ['SellOrderDelta']].values[0][0]
	SellIssuedReg = InputData.loc[InputData["type_id"] == TypeID, ['SellIssuedDelta']].values[0][0]
	BuyIssuedReg = InputData.loc[InputData["type_id"] == TypeID, ['BuyIssuedDelta']].values[0][0]
	# If there is no Buy competition, BuyIssuedReg will return null. In this case, BuyOrderDelta will remain unchanged.
	if str(BuyIssuedReg) == 'nan':
		BuyIssuedDelta = BuyOrderDelta
	else:
	# If BuyIssuedReg is not null, we will predict BuyIssuedDelta given our existing BuyOrderDelta
		BuyIssuedDelta = InputData.loc[InputData["type_id"] == TypeID, ['BuyIssuedDelta']].values[0][0].predict([[OrderbookUnixTime + BuyOrderDelta]])[0][0]
	# If there is no Sell competition, SellIssuedReg will return null. In this case, SellOrderDelta will remain unchanged.
	if str(SellIssuedReg) == 'nan':
		SellIssuedDelta = SellOrderDelta
	else:
	# If SellIssuedReg is not null, we will predict SellIssuedDelta given our existing SellOrderDelta
		SellIssuedDelta = InputData.loc[InputData["type_id"] == TypeID, ['SellIssuedDelta']].values[0][0].predict([[OrderbookUnixTime + BuyOrderDelta + SellOrderDelta]])[0][0]
	# If either SellIssuedDelta or BuyIssuedDelta is projected to be less than 0, this indicates that competition is tending toward completely stifling any participation and this TypeID should be dropped
	if SellIssuedDelta > 0 or BuyIssuedDelta > 0:
		# If OrderDelta exceeds IssuedDelta, this indicates that relisting will be necessary in-order to complete a sale.
		BuyRelistCount = max(0, (BuyOrderDelta / BuyIssuedDelta) - 1)
		SellRelistCount = max(0, (SellOrderDelta / SellIssuedDelta) - 1)
		# Likewise, if OrderDelta exceeds IssuedDelta, our OrderDelta will increase as an inverse function of OrderDelta and IssuedDelta; the more relisting we have to do, the longer a successful trade takes to execute, which results in a larger effective order delta.
		BuyOrderDelta = max(1, (BuyOrderDelta / BuyIssuedDelta)) * (BuyOrderDelta)
		SellOrderDelta = max(1, (SellIssuedDelta / SellOrderDelta)) * (SellOrderDelta)
		# With our OrderDeltas having possibly been changed, we need to recalculate our BuyPrice and SellPrice because our transactions are now occurring on a different part of the timestamp X-axis.
		BuyPrice = InputData.loc[InputData["type_id"] == TypeID, ['lowest']].values[0][0].predict([[OrderbookUnixTime + BuyOrderDelta]])[0][0]
		SellPrice = InputData.loc[InputData["type_id"] == TypeID, ['highest']].values[0][0].predict([[OrderbookUnixTime + BuyOrderDelta + SellOrderDelta]])[0][0]
		# RealSpread now takes into account our RelistCounts, since each relist incurs a minimum fee of 20% applicable brokerage fees.
		RealSpread = SellPrice - BuyPrice - (BuyPrice * 0.015) - (SellPrice * 0.051) - (BuyRelistCount * 0.003) - (SellRelistCount * 0.003)
		# TypeIDs failing to meet our standards for potential profit, ROI, and maximum transaction days will all be filtered from our results.
		pctROI = (RealSpread / ((BuyPrice * 1.015) + (SellPrice * 0.051) + (BuyRelistCount * 0.003) + (SellRelistCount * 0.003))) * 100
		if RealSpread > 0 and SellPrice > 0 and pctROI > MinROI and BuyOrderDelta + SellOrderDelta < 86400 * MaxTransactionDays:
			return[[TypeID, BuyOrderDelta, SellOrderDelta, BuyPrice, SellPrice, RealSpread, BuyRelistCount, SellRelistCount]]
		else:
			return('none')
	else:
		return('none')

In-summary, when competition causes the orderbook to change faster than transactions can be executed, the corresponding order delta increases. This change in the delta then forces us to recalculate price, since a sale is now occurring at a later point in-time than we originally anticipated before factoring-in competition.

Before we merge our competition-adjusted results into our ResultData dataframe, we’ll need to purge our old data first, after which we can do some clean-up before moving-on to our next section:

ResultData = ResultData.drop(columns=['BuyOrderDelta', 'SellOrderDelta', 'BuyPrice', 'SellPrice', 'RealSpread'])
ResultData = pd.merge(ResultData, OutputData, how="right", on=["type_id", "type_id"])

## refresh TypeID list and purge unwanted TypeIDs from HistoryData and regressions
TypeID_df = pd.DataFrame(ResultData.type_id.unique(), columns=['type_id'])
TypeID_list = list(TypeID_df.type_id)
HistoryData = pd.merge(HistoryData, TypeID_df, how="right", on=["type_id", "type_id"])
regressions = pd.merge(regressions, TypeID_df, how="right", on=["type_id", "type_id"])

Orderbook Price Regressions and Modeling

Up-until this point, we have been using the difference between the ‘lowest’ and ‘highest’ prices provided by the historical endpoint data to calculate our spread used when calculating profit. This is in-contrast to our orignal approach which exclusively used ‘price’ from the order book data, taking the difference between the maximum price from Buy orders and minimum price from Sell orders to calculate the spread.

While we’ve established through our testing that deriving the spread from the historical endpoint data is a more robust approach for calculating profit pottential, price data from the order book can still provide some moderating utility. For instance, while our modeling may indicate a specific purchase and sale price, if the order book features a maximum Buy or minimum Sell price which bisects our result, we cannot take our result at face value since the state of the order book defines the market that we can presently interact with. Additionally, it is possible for the spread defined by the order book to exist entirely outside the scope defined by the historical endpoint. In circumstances where current or projected order book pricing far exceeds our historical endpoint regression predictions, it is prudent to drop said results from our final report, as the market likely reflects inflated pricing which we should not engage with.

With this in-mind, let’s get started building more regressions using the order book data:

## save TypeID list as sql temp table
TypeID_df.to_sql(f"typeid_list", tempdb, if_exists="replace", index=True)
		
## create dataframes containing maximum buy price and minimum sell price for each type_id per timestamp. Range of timestamps is dictated by OrderbookRegressionSpan (in days)
OrderbookMaxBuyPrice = pd.read_sql(f"SELECT timestamp, {region}.type_id AS type_id, max(price) AS price FROM {region}, typeid_list WHERE is_buy_order = 1 AND typeid_list.type_id = {region}.type_id AND timestamp >= {OrderbookUnixTime} - ({OrderbookRegressionSpan} * 86400) GROUP BY timestamp, {region}.type_id", tempdb)
OrderbookMinSellPrice = pd.read_sql(f"SELECT timestamp, {region}.type_id AS type_id, min(price) AS price FROM {region}, typeid_list WHERE is_buy_order = 0 AND typeid_list.type_id = {region}.type_id AND timestamp >= {OrderbookUnixTime} - ({OrderbookRegressionSpan} * 86400) GROUP BY timestamp, {region}.type_id", tempdb)
		
## generate regressions for MaxBuy and MinSell prices
regressions = pd.merge(regressions, lrstaging(OrderbookMaxBuyPrice, 'price'), how="left", on=["type_id", "type_id"])
regressions = regressions.rename(columns={"price": "MaxBuyPrice"})
regressions = pd.merge(regressions, lrstaging(OrderbookMinSellPrice, 'price'), how="left", on=["type_id", "type_id"])
regressions = regressions.rename(columns={"price": "MinSellPrice"})

## clear OrderbookMinMax values from memory
del OrderbookMaxBuyPrice
del OrderbookMinSellPrice

In-addition to taking these regressions, we’ll also grab the latest Min/Max prices from the order book in their plain form, saved to our ResultData as ‘CurrentMaxBuyPrice’ and ‘CurrentMinSellPrice’:

# obtain latest max/min prices, splice to ResultData
OrderbookMaxBuyPrice = pd.read_sql(f"SELECT timestamp, {region}.type_id AS type_id, max(price) AS price FROM {region}, typeid_list WHERE is_buy_order = 1 AND typeid_list.type_id = {region}.type_id AND timestamp = {OrderbookUnixTime} GROUP BY {region}.type_id", tempdb)
OrderbookMinSellPrice = pd.read_sql(f"SELECT timestamp, {region}.type_id AS type_id, min(price) AS price FROM {region}, typeid_list WHERE is_buy_order = 0 AND typeid_list.type_id = {region}.type_id AND timestamp = {OrderbookUnixTime} GROUP BY {region}.type_id", tempdb)
ResultData = pd.merge(ResultData,OrderbookMaxBuyPrice[['type_id','price']], how="left", on=["type_id", "type_id"])
ResultData = ResultData.rename(columns={'price': 'CurrentMaxBuyPrice'})
ResultData = pd.merge(ResultData,OrderbookMinSellPrice[['type_id','price']], how="left", on=["type_id", "type_id"])
ResultData = ResultData.rename(columns={'price': 'CurrentMinSellPrice'})
		
## clear OrderbookMinMax values from memory
del OrderbookMaxBuyPrice
del OrderbookMinSellPrice

We’ll use these new MaxBuyPrice and MinSellPrice regressions, as well as our static current price data to feed a new modeling function:

## use MaxBuyPrice and MinSellPrice regressions to adjust price values given position/movement of the spread on the order book
InputData = pd.merge(regressions[['type_id','MaxBuyPrice','MinSellPrice']],ResultData,on='type_id', how='left')
OutputData = pd.DataFrame((miscstaging(TypeID_list, InputData, maxbuyminsellmodel)), columns=['type_id', 'BuyPrice', 'SellPrice', 'RealSpread', 'pctROI'])
ResultData = ResultData.drop(columns=['BuyPrice', 'SellPrice', 'RealSpread'])
ResultData = pd.merge(ResultData, OutputData, how="right", on=["type_id", "type_id"])

The function ‘maxbuyminsellmodel’ looks like this:

## calculate and return revised BuyPrice, SellPrice and RealSpread given MinSell/MaxBuy movement/behavior on the order book.
def maxbuyminsellmodel(TypeID, InputData):
	# pull needed values from InputData
	BuyOrderDelta = InputData.loc[InputData["type_id"] == TypeID, ['BuyOrderDelta']].values[0][0]
	SellOrderDelta = InputData.loc[InputData["type_id"] == TypeID, ['SellOrderDelta']].values[0][0]
	BuyPrice = InputData.loc[InputData["type_id"] == TypeID, ['BuyPrice']].values[0][0]
	SellPrice = InputData.loc[InputData["type_id"] == TypeID, ['SellPrice']].values[0][0]
	RealSpread = InputData.loc[InputData["type_id"] == TypeID, ['RealSpread']].values[0][0]
	MaxBuyPrice = InputData.loc[InputData["type_id"] == TypeID, ['MaxBuyPrice']].values[0][0]
	BuyRelistCount = InputData.loc[InputData["type_id"] == TypeID, ['BuyRelistCount']].values[0][0]
	SellRelistCount = InputData.loc[InputData["type_id"] == TypeID, ['SellRelistCount']].values[0][0]
	CurrentMaxBuyPrice = InputData.loc[InputData["type_id"] == TypeID, ['CurrentMaxBuyPrice']].values[0][0]
	CurrentMinSellPrice = InputData.loc[InputData["type_id"] == TypeID, ['CurrentMinSellPrice']].values[0][0]
	# If there are no recent Buy orders, MaxBuyPrice will be null. In this circumstance, MaxBuyPrice is effectively 0.
	if str(MaxBuyPrice) == 'nan':
		MaxBuyPrice = 0
	# If there are Buy orders, we'll use the regression to determine MaxBuyPrice at our BuyOrderDelta
	else:
		MaxBuyPrice = InputData.loc[InputData["type_id"] == TypeID, ['MaxBuyPrice']].values[0][0].predict([[OrderbookUnixTime + BuyOrderDelta]])[0][0]
	MinSellPrice = InputData.loc[InputData["type_id"] == TypeID, ['MinSellPrice']].values[0][0]
	# Likewise, if there are no recent Sell orders, MinSellPrice will be null. In this circumstance, we'll simply set it equal to the SellPrice defined by the historical endpoint data.
	if str(MinSellPrice) == 'nan':
		MinSellPrice = SellPrice
	# If there are Sell orders, we'll use the regression to determine MinSellPrice at our BuyOrderDelta + SellOrderDelta
	else:
		MinSellPrice = InputData.loc[InputData["type_id"] == TypeID, ['MinSellPrice']].values[0][0].predict([[OrderbookUnixTime + BuyOrderDelta + SellOrderDelta]])[0][0]
	# Price moderation; if the MaxBuyPrice regression exceeds BuyPrice, we need to set BuyPrice to MaxBuyPrice since it is impossible to obtain unit quantity below the maximum value indicated by the market. By the same token, if SellPrice exceeds MinSellPrice, we need to set SellPrice to MinSellPrice since it is impossible to sell unit quantity above the minimum price set by the market.
	if MaxBuyPrice > BuyPrice:
		BuyPrice = MaxBuyPrice
	if MinSellPrice < SellPrice:
		SellPrice = MinSellPrice
	# We also need to perform the same price moderation against current order book prices; irregardless of what our regressions predict, it is unhelpful if we go straight to the market and find that current prices are unfavorable.
	if CurrentMaxBuyPrice > BuyPrice:
		BuyPrice = CurrentMaxBuyPrice
	if CurrentMinSellPrice < SellPrice:
		SellPrice = CurrentMinSellPrice
	# Recalculate RealSpread and pctROI given adjusted SellPrice/BuyPrice values
	RealSpread = SellPrice - BuyPrice - (BuyPrice * 0.015) - (SellPrice * 0.051) - (BuyRelistCount * 0.003) - (SellRelistCount * 0.003)
	pctROI = (RealSpread / ((BuyPrice * 1.015) + (SellPrice * 0.051) + (BuyRelistCount * 0.003) + (SellRelistCount * 0.003))) * 100
	if RealSpread > 0 and SellPrice > 0 and pctROI > MinROI:
		return[[TypeID, BuyPrice, SellPrice, RealSpread, pctROI]]
	else:
		return('none')

The result includes new BuyPrice, SellPrice, spread, and ROI values which all take into account both the movement and present state of order book pricing.

Final Steps

Since our price data has changed and additional TypeIDs have been filtered, we can do some clean-up and run our data through the potential profit model a second time:

## refresh TypeID list and purge unwanted TypeIDs from HistoryData and regressions
TypeID_df = pd.DataFrame(ResultData.type_id.unique(), columns=['type_id'])
TypeID_list = list(TypeID_df.type_id)
HistoryData = pd.merge(HistoryData, TypeID_df, how="right", on=["type_id", "type_id"])
regressions = pd.merge(regressions, TypeID_df, how="right", on=["type_id", "type_id"])
## save TypeID list as sql temp table
TypeID_df.to_sql(f"typeid_list", tempdb, if_exists="replace", index=True)
		
## recalculate potential daily profit, given adjusted price values
ResultData = ResultData.drop(columns=['PotentialDailyProfit', 'VolPerOrder'])
InputData = pd.merge(regressions[['type_id','VolPerOrder']],ResultData,on='type_id', how='left')
OutputData = pd.DataFrame((miscstaging(TypeID_list, InputData, potentialprofitmodel)), columns=['type_id','PotentialDailyProfit','VolPerOrder'])
ResultData = pd.merge(ResultData, OutputData, how="right", on=["type_id", "type_id"])

At this point, we are effectively done with all regression building and modeling; we now just need to neatly format our results for human consumption.

First, let’s use our InvTypes table to attach English item names to our ResultData:

# attach english item names fom InvTypes
cur.execute(f"CREATE TABLE typenames AS SELECT typeID as type_id, typeName AS name FROM InvTypes;")
TypeID_names = pd.read_sql("SELECT * FROM typenames", tempdb)
ResultData = pd.merge(ResultData, TypeID_names, how="inner", on=["type_id", "type_id"])

Next, we’ll run some basic arithmetic operations to consolidate some data points. ‘TotalDelta’ is our combind BuyOrder and SellOrder deltas, divided by 60 to provide a result in minutes rather than in seconds which should be more readable. Likewise, we are also combining our BuyRelistCount and SellRelistCount into a unified ‘RelistCt’ value:

# misc arithmetic for end-user consumption.
ResultData['TotalDelta'] = (ResultData['BuyOrderDelta'] + ResultData['SellOrderDelta']) / 60
ResultData['RelistCt'] = ResultData['BuyRelistCount'] + ResultData['SellRelistCount']

Next, we need to identify our desired report fields. Let’s run-through and identify all of the data points we’ll need from our ResultData dataframe for station trading:

The English name and type_id identify what to trade
PotentialDailyProfit and pctROI represent the value proposition of a trade
BuyPrice and SellPrice indicate the target purchase and sale price of a trade
VolPerOrder indicates the trade volume needed to attain our PotentialDailyProfit
TotalDelta indicates the trade duration
RelistCt indicates trade effort

While it is possible that there may be other data points we could add to this list, these fields are all perfectly adequate within the context of station trading in EvE.

So, let’s filter our table by these values. We’ll also set our default sorting to PotentialDailyProfit, since trade value is the factor we care the most about. If other users desire an alternate arrangement, our published table will provide custom sorting options:

# sort data in our desired sequence
ResultData = ResultData[['name', 'type_id', 'PotentialDailyProfit', 'pctROI', 'BuyPrice', 'SellPrice', 'VolPerOrder', 'RealSpread', 'TotalDelta', 'RelistCt']]
ResultData = ResultData.sort_values(by='PotentialDailyProfit', ascending=False)

Decluttering also involves removing trailing digits; none of our values require float precision, so all decimal values can be chopped-off to improve readability. The following lines convert our results into integers:

# convert all types to integer
ResultData['PotentialDailyProfit'] = np.floor(pd.to_numeric(ResultData['PotentialDailyProfit'], errors='coerce')).astype('Int64')
ResultData['pctROI'] = np.floor(pd.to_numeric(ResultData['pctROI'], errors='coerce')).astype('Int64')
ResultData['BuyPrice'] = np.floor(pd.to_numeric(ResultData['BuyPrice'], errors='coerce')).astype('Int64')
ResultData['SellPrice'] = np.floor(pd.to_numeric(ResultData['SellPrice'], errors='coerce')).astype('Int64') 
ResultData['RealSpread'] = np.floor(pd.to_numeric(ResultData['RealSpread'], errors='coerce')).astype('Int64')
ResultData['TotalDelta'] = np.floor(pd.to_numeric(ResultData['TotalDelta'], errors='coerce')).astype('Int64')
ResultData['RelistCt'] = np.floor(pd.to_numeric(ResultData['RelistCt'], errors='coerce')).astype('Int64')

This report is now being published on the Projects page as “Station Trading” for all regions with results being automatically refreshed approximately once every hour. For those interested in tweaking and running this script themselves, StationTradingScript.py has also been published; if any changes are made in the future, I’ll be sure to include an addendum.

More Testing Observations and Closing Notes

After testing our new reporting schema in-game, it appears to have overcome all of the limitations of the old schema, with reported items trading far more consistently and with profit projections being far more accurate.

However, there are still some issues to note:

Volatility

Despite favorable report results, I’ve noticed that some classes of commodities experience a greater degree of volatility than others, with SKINs, clothing, and other cosmetic items all suffering extreme price fluctuations. While it is possible to make a lot of ISK trading these items, it is also easy to lose a lot of ISK following a crash.

One aspect of linear regression analysis which we did not cover in this article involves the use of the error term or residual. Many of the regression slopes we’ve calculated probably don’t ‘fit’ the provided data very well, which is to say that they have a large error term. Despite this, our modeling interprets the slope of each regression at face value without taking the error term into account. A more nuanced reporting strategy would involve taking the variance of residuals into account and weighing our report output against the heteroskedasticity of each regression slope.

My adaptive approach to volatility throughout my testing has simply involved being cognizant of commodity class when placing and relisting orders; for cosmetic items, I’m unlikely to place a Buy order with a price much higher than what is indicated by the report even if there is quite a lot of room within the spread to work with. For more expensive commodities, I’m more likely to pay close attention to historical prices, ignoring opportunities which are trending far higher in their maximum Buy offers than their historical minimum. The in-game price history chart makes this step quite trivial to perform without requiring further changes to our script.

Market Manipulation

EvE Online is unique among MMORPGs and real-life securities exchanges in-that market manipulation is permissible. Players will not be banned or otherwise punished for engaging in market behavior which might otherwise be considered rule-breaking or illegal in other markets.

The result of this reality is that ESI market data derived from market manipulation can and will infiltrate our database and report results. This isn’t a ding against our report results as much as it a general warning against being over-reliant on any automated report generation strategy for EvE Online; it is prudent for traders to remain vigilant and to double-check anything that might seem too good to be true; it is not outside the realm of possibilities for a player to read this article and to engage in market behavior which exploits player reliance on our method of report generation.

Use of Alternate Regression Methods

Our script uses linear regression, which involves drawing a least-squares line straight across a set of data. It is one of the simplest and most widely-used regression methods in statistical analysis.

There are many other types of regression methods which have their own strengths and weaknesses. Polynomial regressions for instance can draw non-linear least-squares curves across data, allowing more nuanced inferences to be potentially made. Random forest and other machine learning algorithms can also provide more nuanced predictive analysis.

While we don’t necessarily need to adopt other methods, we can keep these in-mind in-case we find linear regression to be too limiting or primitive for this or future use-cases. In the meantime, linear regression provides a computationally inexpensive method for predictive analysis which we’ll continue using for the time being.

Execution Duration

Running the script against seven regions with all indicated settings takes a little over 6 minutes to complete on a machine with a solid-state drive and a processor with fast single-threaded performance. On the machine I use for automated report generation, this takes closer to 30 minutes running inside a throttled LXC.

While this execution duration doesn’t currently pose a problem, if I decide in the future to increase the granularity of my locally-cached order book data from 1 hour to 5-10 minutes, this execution time could start to pose a bottleneck. This execution time could also pose a problem if I decide to adjust the OrderbookRegressionSpan variable beyond 1 (day); during my testing, doubling this value tended to roughly quadruple execution time.

Fortunately, there are a lot of different ways our script execution time could be addressed:

Tightening our global variables/thresholds will reduce the total amount of TypeIDs the script will need to run calculations for. This can be achieved by Increasing MinROI or MinProfitPerDay, or by reducing MaxTransactionDays. While this won’t eliminate the large number of regressions calculated at the very start of the script, it will reduce the CPU and filesystem I/O needs of subsequent calculations since there will be fewer TypeIDs remaining as the script continues.
Multithreading/Multiprocessing: All of our regression generation and modeling functions could be very trivially handed to multiple processes for concurrent calculation; the workload of each function is already split into TypeIDs, and with thousands of TypeIDs, this workload could be effectively split between potentially hundreds or thousands of separate processing cores.
Better routines: It is very likely that our current procedure for calculating regression slopes and performing data modeling is not the most efficient approach, and it might be worth attempting to identify specific bottlenecks and seek refactoring approaches which could return the same output.

For now, there is no need to improve execution time, however these are all some points to keep in-mind if we ever find that we need faster execution.

Conclusion

Throughout this first foray into report generation for EvE Online, we developed a primitive station trading report which we then vetted and transformed into a more advanced routine leveraging the power of linear regression and data modeling to provide useful results.

In our next article, we will investigate the phenomenon of price ceilings and floors as they appear in EvE Online and will develop more reporting tools leveraging our wealth of ESI-derived data.

Thank you for reading!

*Update: “Virtual Markets, Part Nine: Reprocessing” is now live!

August 27 2024 Addendum

Since publishing this article, I started to recognize some unwanted reporting behavior. Namely, some commodities which have seldomly traded on the market have been appearing in reports displaying TotalDelta values which seem impossible given historic trade volume.

In this addendum, we’ll perform some root-cause analysis and figure out what might be causing this.

Troubleshooting our TotalDelta

Using a set of data from August 12 to investigate, one item in-particular caught my attention: Type_ID 14508, or “Brynn’s Modified 500MN Microwarpdrive”. This commodity, despite having only traded twice in the past month, is reporting an impossible TradeDelta of 2880 minutes (4 days).

So, there is definitely something wrong with how the TradeDelta is being calculated for at least this particular item.

Let’s start by taking a look at our data for Type_id 14508 immediately following the data normalization routine at the start of our runtime before executing any regression or data modeling functions. Running a select query for 14508 reveals the following table:

BuySellRatio	type_id	date	timestamp	average	highest	lowest	volume	order_count	MinDate	BiasedVolume	BuyVolume	SellVolume	BuyOrders	SellOrders
0.5	14508	2024-07-14	1721042100					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-15	1721128500					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-16	1721214900					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-17	1721301300					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-18	1721387700					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-19	1721474100					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-20	1721560500					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-21	1721646900					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-22	1721733300					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-23	1721819700					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-24	1721906100					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-25	1721992500	3121500000	3493000000	2750000000	2	2	2023-07-28	0	1	1	1	1
0.5	14508	2024-07-26	1722078900					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-27	1722165300					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-28	1722251700					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-29	1722338100					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-30	1722424500					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-07-31	1722510900					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-08-01	1722597300					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-08-02	1722683700					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-08-03	1722770100					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-08-04	1722856500					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-08-05	1722942900					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-08-06	1723029300					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-08-07	1723115700					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-08-08	1723202100					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-08-09	1723288500					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-08-10	1723374900					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-08-11	1723461300					0	2023-07-28	0	0	0	0	0
0.5	14508	2024-08-12	1723547700					0	2023-07-28	0	0	0	0	0

Our normalized data appears unproblematic; ‘BiasedVolume’, ‘BuyVolume’, ‘SellVolume’, ‘BuyOrders’, and ‘SellOrders’ have all been normalized to display ‘0’ as their values for all days where no trade activity occurred.

Likewise, our raw values ‘average’, ‘highest’ and ‘lowest’ all present null values for every date with zero transactions, which is deliberate since price is undefined (and not necessarily 0) for any day without trade activity. The only questionable aspect of this data set are the raw null values for “volume” which would be more accurately represented by 0, however this should be a non-issue since we don’t use this value in our regression analysis.

So far so good, let’s move-on to our first set of regressions, ‘BuyOrders’, which plays a role in calculating our TotalDelta:

...
regressions = lrstaging(HistoryData, 'BuyOrders')
...

This line calls the lrstaging and lrmodel functions we use to generate regression curves for all Type_ids. We can borrow and modify a line from one of our modeling functions to take a peek at the regression result for 14508 given the current unix time:

...
    regressions = lrstaging(HistoryData, 'BuyOrders')
    BuyOrderTest = regressions.loc[regressions["type_id"] == 14508, ['BuyOrders']].values[0][0].predict([[OrderbookUnixTime]])[0][0]
    print(BuyOrderTest)
...

Given our set of normalized data where BuyOrders values are all 0 with the exception of a single 1, we should expect our regression to predict a positive value less than 1 that is very close to, but not equal to 0. Executing our script reveals…

...
1.0
...

… So, there is clearly something wrong with how our regression slope is being calculated for this Type_id. This is further evident when we replace “OrderbookUnixTime” with any arbitrary number; the value returned is always 1. Our regression slope… is just a flat line. This result immediately explains the overly-optimistic TotalDelta values we’ve been obtaining, as this inflated value is influencing the calculations performed by our latter data modeling functions.

So, let’s take a look next at our first regression function, lrstaging:

## stages/groups data before feeding it into lrmodel function.	
def lrstaging(HistoryData, yvalue):
	NewRegression = HistoryData.dropna().groupby('type_id').apply(lrmodel, yvalue, include_groups=False).reset_index()
	NewRegression.columns = ['type_id', yvalue]
	return(NewRegression)

Here, our data is transformed a bit before being fed into the lrmodel function where the regressions are calculated. dropna() removes null values while groupby() ensures that our results are split according to their type_id. Let’s start by dropping dropna() to see if there is any change to our BuyOrderTest:

...
NewRegression = HistoryData.groupby('type_id').apply(lrmodel, yvalue, include_groups=False).reset_index()
...
print(BuyOrderTest)
...
0.00921703126931206

… And therein lies the root of our problem.

Dropna() Context

My original intent when including dropna() was to exclude null values from being used in regression calculations for specific values, namely ‘price’. The reasoning is that null values in this context are effectively meaningless; if a store stocks an item and no units of the item sell on a particular day, price cannot be said to be any value at all because no transaction occured. Price is not 0, nor is it the listed price, nor is it the price last sold at on an earlier day; it is strictly indefinite. As a result, it should not be included in any timeseries regression calculations when attempting to predict the movement or behavior of price… and thus in this context, dropna() ensures that regressions are calculated correctly. The same logic also extends to ‘issued time’, and ‘volume per order’ where the absence of a value does not imply any other value.

… However, the inclusion of dropna() has created an adverse effect when calculating regressions for ‘BuyVolume’ and ‘SellVolume’ despite our prior data normalization step which converted all null values for these columns to 0. As it turns-out, per the dropna() documentation, the default behavior of the command is to drop any rows in a dataframe where any value in that row is null… including the aforementioned raw values ‘average’, ‘lowest’, ‘highest’, and our unused ‘volume’.

Finding and Implementing a Fix

There are a handful of ways in which we could solve this issue. One approach would involve limiting the scope of columns which are fed into the lrstaging function; dropna() can only drop rows with null values if it can see them. This approach would involve making quite a few small tweaks to our code and might be the ideal and possibly more performant path to take if we planned on introducing new data into our script in the future to run regressions against. We will couch this idea for now and possibly implement it if we decide we need to perform a full refactor again in the future.

Another simpler approach would involve splitting the lrstaging function into two separate functions. One function can exclude dropna() and be used for our ‘BuyVolume’ and ‘SellVolume’ regressions, while the other function would be identical to the original and be used for all other regressions. Since this approach is overall more expedient, it is the approach we’ll take for now.

Here is the result of our function splitting:

## stages/groups data before feeding it into lrmodel function. This version does not drop null values, which is desired for variables such as 'BuyVolume' and 'SellVolume' where null values are meaningfully 0 and need to be included in our regression in-order to produce an accurate result.
def lrstagingIncludeNull(HistoryData, yvalue):
	NewRegression = HistoryData.groupby('type_id').apply(lrmodel, yvalue, include_groups=False).reset_index()
	NewRegression.columns = ['type_id', yvalue]
	return(NewRegression)
	
## stages/groups data before feeding it into lrmodel function. This version drops null values, which is desired for variables such as 'price', 'issued time', and 'volume per order' where null values are meaningless and omission is needed in-order to produce an accurate result.
def lrstagingOmitNull(HistoryData, yvalue):
	NewRegression = HistoryData.dropna().groupby('type_id').apply(lrmodel, yvalue, include_groups=False).reset_index()
	NewRegression.columns = ['type_id', yvalue]
	return(NewRegression)

Commands which used to call upon lrstaging have also been reconfigured to now call either lrstagingIncludeNull or lrstagingOmitNull depending on their data type.

An audit of our new report results reveal far more realistic TotalDelta values, including the complete omission of Type_ids such as 14508 which exceed our MaxTransactionDays value.

Lessons Learned

Null values can be tricky to navigate and manage in any data analytics workflow. Failure to handle them properly can create subtle, yet cascading issues resulting in misleading report results like what we’ve experienced with our TotalDelta value. Issues can also occur in-spite of data normalization procedures. As a result, any data processing routines which involve the handling of null values need to be deliberately prescribed and tested, as making assumptions can and will lead to unintended results.

Our revised station trading script can be downloaded here. If any future changes are made, they’ll be recorded here as part of a new update. Thank you for reading!

Bugfix, Oct 7 2023

After letting the script run over the past month, I encountered maybe one or two faults which caused the script to crash. After some investigation, it appears that this occurred in circumstances where our data modeling functions returned empty dataframes, which can happen when no valid items are available to trade in the current region which meet our defined thresholds. The resulting empty dataframe would cause subsequent operations to fail, throwing an exception. While rare, this can present a more annoying problem if pctROI or MinProfitPerDay are set to large values.

I’ve solved this issue by adding a simple if statement after the execution of each data modeling function:

...		
		if OutputData.empty:
			ResultData = nullresult()
			resultexport(ResultData)
			continue	
...

This allows empty dataframes to be detected, which triggers the generation of a generic output defined by the new nullresult() function. I’ve also compartmentalized export parameters in a resultexport() function as part of this update. The published copy of the script now includes these changes.