Sampling-based Feedback Directed Optimization (FDO) methods like AutoFDO and BOLT that employ profiles continuously collected in live production environments, are commonly used in datacenter applications to attain significant performance benefits without the toil of maintaining representative load tests. Sampled profiles rely on hardware facilities like Intel’s Last Branch Record (LBR) which are not currently available even on popular CPUs from ARM or AMD. Since not all architectures include a hardware LBR feature, we present an architecture agnostic approach to collect LBR-like data. We use sampling and limited program tracing to capture LBR like data from optimized and unmodified applications binaries. Since the implementation is in user space, we can collect arbitrarily long LBR buffers, and by varying the sampling rate, we can adjust the runtime overhead to arbitrarily low values. We target runtime overheads of <2% when the profiler is on and zero when it’s off. This amortizes to negligible fleet-wide collection cost given the size of a modern production fleet. We implemented a profiler that uses this method of software branch tracing. We also analyzed its overhead and the similarity of the data it collects to the Intel LBR hardware using the SPEC2006 benchmarks. Results demonstrate profile quality and optimization efficacy at parity with LBR-based AutoFDO and the target profiling overhead being achievable even without implementing any advanced tuning.