python lxml etree applet information from yahoo

python python-3.x web-scraping lxml

The page is quite dynamic and involves a lot of javascript executed in a browser. To follow the @Padraic's advice about switching to selenium, here is a complete sample working code that produces a month-to-trend dictionary at the end. The values of each bar are calculated as proportions of bar heights:

from pprint import pprintfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.support.wait import WebDriverWaitdriver = webdriver.Chrome()driver.maximize_window()driver.get("https://finance.yahoo.com/quote/CSX/analysts?p=CSX")# wait for the chart to be visiblewait = WebDriverWait(driver, 10)trends = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "section[data-reactid$=trends]")))chart = trends.find_element_by_css_selector("svg.ratings-chart")# get labelsmonth_names = [month.text for month in chart.find_elements_by_css_selector("g.x-axis g.tick")]trend_names = [trend.text for trend in trends.find_elements_by_css_selector("table tr > td:nth-of-type(2)")]# construct month-to-trend dictionarydata = {}months = chart.find_elements_by_css_selector("g[transform]:not([class])")for month_name, month_data in zip(month_names, months):    total = month_data.find_element_by_css_selector("text.total").text    data[month_name] = {'total': total}    bars = month_data.find_elements_by_css_selector("g.bar rect")    # let's calculate the values of bars as proportions of a bar height    heights = {trend_name: int(bar.get_attribute("height")) for trend_name, bar in zip(trend_names[::-1], bars)}    total_height = sum(heights.values())    for trend_name, bar in zip(trend_names, bars):        data[month_name][trend_name] = heights[trend_name] * 100 / total_heightdriver.close()pprint(data)

Prints:

{u'Aug': {u'Buy': 19,          u'Hold': 45,          u'Sell': 3,          u'Strong Buy': 22,          u'Underperform': 8,          'total': u'26'}, u'Jul': {u'Buy': 18,          u'Hold': 44,          u'Sell': 3,          u'Strong Buy': 25,          u'Underperform': 7,          'total': u'27'}, u'Jun': {u'Buy': 21,          u'Hold': 38,          u'Sell': 3,          u'Strong Buy': 28,          u'Underperform': 7,          'total': u'28'}, u'May': {u'Buy': 21,          u'Hold': 38,          u'Sell': 3,          u'Strong Buy': 28,          u'Underperform': 7,          'total': u'28'}}

The total values are labels that you see on top of each bar.

Hope this would at least be a good start for you. Let me know if you want me to elaborate on any part of the code or require any additional information.

python python-3.x web-scraping lxml

As comments say they have moved to ReactJS, so lxml is no longer to the point because there's no data in the HTML page. Now you need to look around and find the endpoint where they are pulling the data from. In case of Recommendation Trends it's there.

#!/usr/bin/env python3import jsonfrom pprint import pprintfrom urllib.request import urlopenfrom urllib.parse import urlencodedef parse():    host   = 'https://query2.finance.yahoo.com'    path   = '/v10/finance/quoteSummary/CSX'    params = {        'formatted' : 'true',        'lang'      : 'en-US',        'region'    : 'US',        'modules'   : 'recommendationTrend'    }    response = urlopen('{}{}?{}'.format(host, path, urlencode(params)))    data = json.loads(response.read().decode())    pprint(data)if __name__ == '__main__':    parse()

The output looks like this.

{  'quoteSummary': {    'error': None,    'result': [{      'recommendationTrend': {        'maxAge': 86400,        'trend': [{            'buy': 0,            'hold': 0,            'period': '0w',            'sell': 0,            'strongBuy': 0,            'strongSell': 0          },          {            'buy': 0,            'hold': 0,            'period': '-1w',            'sell': 0,            'strongBuy': 0,            'strongSell': 0          },          {            'buy': 5,            'hold': 12,            'period': '0m',            'sell': 2,            'strongBuy': 6,            'strongSell': 1          },          {            'buy': 5,            'hold': 12,            'period': '-1m',            'sell': 2,            'strongBuy': 7,            'strongSell': 1          },          {            'buy': 6,            'hold': 11,            'period': '-2m',            'sell': 2,            'strongBuy': 8,            'strongSell': 1          },          {            'buy': 6,            'hold': 11,            'period': '-3m',            'sell': 2,            'strongBuy': 8,            'strongSell': 1          }]        }    }]  }}

How to look for data

What I did was roughly:

Find some unique token in the target widget (say chart value or Trend string)
Open source of the page (use some formatter for HTML and JS, e.g. this)
Look for the token there (in the the page three is section that starts with /* -- Data -- */)
Search for ".js" to get script tags (or programmatic inclusions, e.g. require.js) and look for token there
Open network tab in Firebug or Chromium Developer Tools and inspect XHR requests
Then use Postman (or curl if you prefer terminal) to strip extra parameters and see how the endpoint reacts

CodeHunter

python lxml etree applet information from yahoo

How to look for data

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last