Unable to parse an exact result from a webpage using requests Unable to parse an exact result from a webpage using requests json json

Unable to parse an exact result from a webpage using requests


As mentioned in the comments, you need to convert the date to the appropriate timezone (EST), which can be done with datetime and dateutil.

Here is a working example:

import reimport jsonimport requestsfrom datetime import datetime, timezonefrom dateutil import tzurl = 'https://finance.yahoo.com/quote/GTX/financials?p=GTX'res = requests.get(url)data = re.findall(r'root.App.main[^{]+(.*);',res.text)[0]jsoncontent = json.loads(data)container = jsoncontent['context']['dispatcher']['stores']['QuoteSummaryStore']['incomeStatementHistoryQuarterly']['incomeStatementHistory']total_revenue = container[0]['totalRevenue']['raw']EST = tz.gettz('EST')raw_date = datetime.fromtimestamp(container[0]['endDate']['raw'], tz=EST)concerning_date = raw_date.date().strftime('%d-%m-%Y')print(total_revenue, concerning_date)


The updated section of this answer outlines the root cause of the date differences.


ORIGINAL ANSWER


Some of the raw values in your JSON are UNIX timestamps.

Reference from your code with modifications:

concerning_date_fmt = container[0]['endDate']['fmt']concerning_date_raw = container[0]['endDate']['raw']print(f'{concerning_date} -- {concerning_date_raw}')# output 2019-07-28 -- 1564272000 

'endDate': {'fmt': '2019-07-28', 'raw': 1564272000}

1564272000 is the number of elapsed seconds since January 01 1970. This date was the start of the Unix Epoch and the time is in Coordinated Universal Time (UTC). 1564272000 is the equivalent to: 07/28/2019 12:00am (UTC).

You can covert these timestamps to a standard datetime format by using built-in Python functions

from datetime import datetimeunix_timestamp = int('1548547200')converted_timestamp = datetime.utcfromtimestamp(unix_timestamp).strftime('%Y-%m-%dT%H:%M:%SZ')print (converted_timestamp)# output Coordinated Universal Time (or UTC)2019-07-28T00:00:00Zreformatted_timestamp = datetime.strptime(converted_timestamp, '%Y-%m-%dT%H:%M:%SZ').strftime('%d-%m-%Y')print (reformatted_timestamp)# output28-07-2019

This still does not solve your original problem related to JSON dates and column dates being different at times. But here is my current hypothesis related to the date disparities that are occurring.

  1. The json date (fmt and raw) that are being extracted from root.App.main are in Coordinated Universal Time (UTC). This is clear because of the UNIX timestamp in raw.

  2. The dates being displayed in the table columns seem to be in the Eastern Standard Time (EST) timezone. EST is currently UTC-4. Which means that 2019-07-28 22:00 (10pm) EST would be 2019-07-29 02:00 (2am) UTC. The server hosting finance.yahoo.com looks to be in the United States, based on the traceroute results. These values are also in the json file:

    • 'exchangeTimezoneName': 'America/New_York'
    • 'exchangeTimezoneShortName': 'EDT'
  3. There is also the possibility that some of the date differences are linked to the underlying React code, which the site uses. This issue is harder to diagnose, because the code isn't visible.

At this time I believe that the best solution would be to use the UNIX timestamp as your ground truth time reference. This reference could be used to replace the table column's date.

There is definitely some type of conversion happening between the JSON file and the columns.

NVIDIA JSON FILE: 'endDate': {'raw': 1561766400, 'fmt': '2019-06-29'}

NVIDIA Associated Total Revenue column: 6/30/2019

BUT the Total Revenue column date should be 6/28/2019 (EDT), because the UNIX time stamp for 1561766400 is 06/29/2019 12:00am (UTC).

The disparity with DELL is greater than a basic UNIX timestamp and a EDT timestamp conversion.

DELL JSON FILE:{"raw":1564704000,"fmt":"2019-08-02"}

DELL Associated Total Revenue column: 7/31/2019

If we convert the UNIX timestamp to an EDT timestamp, the result would be 8/1/2019, but that is not the case in the DELL example, which is 7/31/2019. Something within the Yahoo code base has to be causing this difference.

I'm starting to believe that React might be the culprit with these date differences, but I cannot be sure without doing more research.

If React is the root cause then the best option would be to use the date elements from the JSON data.


UPDATED ANSWER 10-17-2019


This problem is very interesting, because it seems that these column dates are linked to a company's official end of fiscal quarter and not a date conversation issue.

Here are several examples for

  • Apple Inc. (AAPL)
  • Atlassian Corporation Plc (TEAM)
  • Arrowhead Pharmaceuticals, Inc. (ARWR):

Their column dates are:

  • 6/30/2019
  • 3/31/2019
  • 12/31/2018
  • 9/30/2018

These dates match to these fiscal quarters.

  • Quarter 1 (Q1): January 1 - March 31.
  • Quarter 2 (Q2): April 1 - June 30.
  • Quarter 3 (Q3): July 1 - September 30.
  • Quarter 4 (Q4): October 1 - December 31

These fiscal quarter end dates can vary greatly as this DELL example shows.

DELL (posted in NASDAQ)End of fiscal quarter: July 2019

Yahoo FinanceColumn date: 7/31/2019

JSON date: 2019-08-02

From the company's website:

When does Dell Technologies’ fiscal year end?

  • Our fiscal year is the 52- or 53-week period ending on the Friday nearest January 31. Our 2020 fiscal year will end on January 31, 2020. For prior fiscal years, see list below: Our 2019 fiscal year ended on February 1, 2019 Our 2018 fiscal year ended on February 2, 2018 Our 2017 fiscal year ended on February 3, 2017 Our 2016 fiscal year ended on January 29, 2016 Our 2015 fiscal year ended on January 30, 2015 Our 2014 fiscal year ended on January 31, 2014 Our 2013 fiscal year ended on February 1, 2013

dell's fiscal quarters

NOTE: The 05-03-19 and 08-02-19 dates.

These are from the JSON quarter data for DELL:

  • {'raw': 1564704000, 'fmt': '2019-08-02'}
  • {'raw': 1556841600, 'fmt': '2019-05-03'}

It seems that these column dates are linked to a company's fiscal quarter end dates. So I would recommend that you either use the JSON date as you primary reference element or the corresponding column date.

P.S. There is some type of date voodoo occurring at Yahoo, because they seem to move these column quarter dates based on holidays, weekends and end of month.


Instead of getting the fmt of the concerning_date, It's better to get the timestamp.

concerning_date = container[0]['endDate']['raw']

In the example above you will get the result 1561852800 which you can transfer into a date with a certain timezone. (Hint: use datetime and pytz). This timestamp will yield the following results based on timezone:

Date in Los Angeles*: 29/06/2019, 17:00:00Date in Berlin* :30/06/2019, 02:00:00Date in Beijing*: 30/06/2019, 07:00:00Date in New York* :29/06/2019, 19:00:00