Overleaf Pulchowk

\documentclass{article} \usepackage{graphicx} % Required for inserting images \usepackage[a4paper, total={6in, 9in}]{geometry} \usepackage{amsmath}

\title{AI} \author{Manoj Khatri} \date{July 2023}

\begin{document}

\maketitle

\section{Introduction} Aritificial Intelligence: -Artificial Intelligence (AI) can be defined in terms of fidelity to human performance, or in an abstract way through rationality but in general, AI is the how to create computer systems that can perform tasks that would normally require human intelligence to accomplish -In other words, AI can be defined as the study of how to make computers do things at which, at the moment, people are better -This involves a combination of techniques, including knowledge representation, automated reasoning, maching learning, and natural language processing

Approaches to AI: -An approach to Ai is a specific method or philosophy that informs how AI is conceived, built and evaluated -Different ways or perspectives that can be taken when designing and developing AI systems:

1. Acting Humanly:
	 -This approach focuses on creating AI systems that can mimic human behavior, such as speech recognition, language translation, and decision-making. The Turing Test is often used to evaluate the success of this approach, which involves testing whether a human evaluator can distinguish between responses given by a computer and those given by a human
	 -Examples include speech recognition systems, such as Siri or Alexa, which can understand and respond to human voice commands
	 
2. Thinking Humanly:
	-This approach focuses on creating AI systems that can model and replicate human thought processes. This approach is based on the idea that human thinking is fundamentally different from other forms of intelligence and that, by understanding how humans think, we can create more advanced and effective AI systems.
	-Examples include cognitive architectures such as Soar, ACT-R, MicroPsi

3. Thinking Rationally: 
	-This approach focuses on creating AI systems that can reason logically and make decisions based on formal rules and representations of knowledge. This involves using techniques such as logic, inference, and knowledge representation to model reasoning.
	-Examples include an expert system that provides advice to medical professional based on a database of medical knowledge and rules

4. Acting Rationally: 
	-This approach focuses on creating AI systems that can make rational decisions based on the available information and their goals. This involves using techniques such as decision theory, game theory, and optimization to maximize expected utility.
	-Examples include autonomous robots that can navigate their environments and performs tasks without human intervention
	
Overall, these approaches to AI are complementary and can be combined to create more robust and intelligent systems. The choice of approach depends on the specific task and domain in which the AI system will be applied.

Turing Test: -Turing test is a test proposed by Alan Turing in 1950 to determine if a machine can exhibit intelligent behavior that is indistinguishable from that of a human -In this test, a human evaluator engages in a natural language conversation with both a human and a machine, without knowing which is which, if the evaluator cannot reliably distinguish between the human and the machine in their conversaion, then the machine is said to have passed the test -The computer would need the following capabilities: 1. Natural langauge processing to communicate successfully in a human language 2. Knowledge representation to store what it knows or ears 3. Automated reasoning to answer questions and to draw new conclusions 4. Machine learning to adapt to new circumstances and to detect and extrapolate patterns -Other researchers proposed a total Turing test which requires interaction with objects and people in the real world, to pass the total Turing test, a robot will need: 5. Computer vision and speech recognition to perceive the world 6. Robotics to manipulate objects to move about

Importance in AI:
1. The Turing test provides a concrete and objective way to evaluate the intelligence of a machine, by testing its ability to communicate with humans in natural language, rather than relying on subjective or abstract way
2. The Turing test has driven research into the nature of intelligence and the ways in which machines might be able to exhibit intelligent behavior that is comparable to humans.
3. The Turing test has led to the development of conversational agents and chatbots that are able to interact with humans in a way that is increasingly indistinguishable from human conversation.
4. The Turing test has also inspired research into the ethical implications of creating intelligent machines, and the potential risks and benefits of creating machines that can think and communicate like humans.
   
Ultimately, the Turing test provides a benchmark for evaluating the progress of AI research, by providing a concrete and practical goal that researchers can work towards.


Limitations:
1. The Turing test does not necessarily measure intelligence in a way that is consistent with human intelligence, as it relies heavily on the ability to mimic human behavior and language, rather than exhibiting truly intelligent behavior.
2. The Turing test assumes that intelligence is a black and white, pass or fail phenomenon, but in reality, intelligence is a complex and multifaceted concept that cannot be reduced to a simple yes or no answer.
3. The Turing test is focused on communication and language, and may not be able to assess other important aspects of intelligence, such as creativity, problem-solving, or emotional intelligence.
4. The Turing test is limited by the fact that it relies on human judges to evaluate the intelligence of machines, and different judges may have different criteria or biases that can affect the results of the test.

 While the test can help to evaluate the behavior of machines in a human-like context, it may not be able to fully capture the potential consequences of creating machines that are capable of exhibiting truly intelligent behavior.

Value-alignment Problem: -The value-alignment problem refers to the challenge of ensuring that a machine’s goals and values align with those of human society. -As we move into the real word, it becomes more and more difficult to specify the objective completely and correctly -For example, in designing a self-driving car, one might think that the objective is to reach the destination safely but driving along any road incurs a risk of injury due to other errant drivers, equipment failure, and so on; thus, a strict goal of safety requires staying in the garage so there is a tradefff between making progress towards the destination and incurring a risk of injury -If we are developing an AI system in the lab or in a simulator—as has been the case for most of the field’s history—there is an easy fix for an incorrectly specified objective: reset the system, fix the objective, and try again, as the field progresses towards increasingly capable intelligent systems that are deployed in the real world, a system deployed with an incorrect objective will have negative consequences

History:

The inception of artificial intelligence (1943–1956): -Warren McCulloch and Walter Pitts proposed a computational model of the brain in 1943 -Alan Turing proposed the Turing Test as a way to determine whether a machine could think like a human. -The Dartmouth Conference marked the beginning of AI as a field of study.
Early enthusiasm, great expectations (1952–1969): -John McCarthy coined the term “artificial intelligence” in 1955 -The development of search algorithms, game playing programs like the first chess program developed by Claude Shannon, and the first AI programs that could prove mathematical theorems, such as the Logic Theorist developed by Allen Newell and Herbert Simon.
A dose of reality (1966–1973): -AI researchers began to realize that many of the early promises of AI were not going to be fulfilled in the near future. -The development of the Shakey robot by SRI International, which was capable of planning actions and carrying out tasks in the physical world, but was limited in its abilities.
Expert systems (1969–1986): -The rise of expert systems, such as MYCIN and DENDRAL, which were used in medical diagnosis and chemical analysis, respectively -The emergence of the first AI startups, including Intellicorp and Symbolics
The return of neural networks (1986–present): -Neural networks were rediscovered as a way to develop machine learning algorithms, leading to the development of backpropagation algorithm. -Development of reinforcement learning techniques for training agents to make decisions in complex environments
Probabilistic reasoning and machine learning (1987–present): -The rise of probabilistic reasoning and graphical models, such as Bayesian networks and Markov random fields -The development of machine learning techniques, such as support vector machines, decision trees, and random forests
Big data (2001–present): -The explosion of digital data from sources such as social media, sensors, and scientific instruments -The development of tools and techniques for storing, processing, and analyzing large datasets, such as Hadoop and Spark
Deep learning (2011–present) -The breakthroughs in deep learning research, including the development of convolutional neural networks and recurrent neural networks -The use of deep learning in a wide range of applications, such as image and speech recognition, natural language processing,

Applications of AI:

Expert sytems: -Give shits
Natural Language Processing (NLP): -NLP involves teaching machines to understand human language. It has been used in applications such as speech recognition, machine translation, and sentiment analysis. Examples include IBM’s Watson, which was used to compete on Jeopardy! in 2011, and Google Translate, which uses machine learning to translate text between languages.
Robotics: -Robotics involves teaching machines to interact with the physical world. Applications include manufacturing, healthcare, and space exploration. Examples include the da Vinci Surgical System, which is used to perform minimally invasive surgery, and NASA’s Mars Rover, which is used to explore the surface of Mars.
Gaming: -AI has been used to create intelligent agents that can play games such as chess, poker, and video games. Examples include IBM’s Deep Blue, which defeated world chess champion Garry Kasparov in 1997, and OpenAI’s Dota 2 bot, which defeated professional players in 2017
Recommender systems: -Recommender systems use AI to provide personalized recommendations to users. Applications include e-commerce, social media, and online advertising. Examples include Netflix’s recommendation engine, which uses machine learning to suggest movies and TV shows to users, and Amazon’s product recommendation system, which suggests products based on users’ browsing and purchase history.
Computer vision: -Computer vision involves teaching machines to understand visual information. Applications include image and video analysis, facial recognition, and self-driving cars. Examples include DeepMind’s AlphaGo, which defeated a human world champion in the game of Go in 2016, and Tesla’s Autopilot, which uses computer vision to drive cars autonomously.
Finance prediction

Agent: -An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators -An agent’s choice of action at any given instant can depend on its built-in knowledge and on the entire percept sequence observed to date, but not on anything it hasn’t perceived -A rational agent is one that does the right thing, when an agent is plunked down in an envrionment, it generates a sequence of actions according to the percepts it receives causes sthe environment to go through a sequence of states, if the sequence is desirable, then the agen thas performed well -For each possible percept sequence, a rational agent should select an action that is expected to maximize its performance measure, given the evidence provided by the percept sequence and whatever built-in knowledge the agent has -The performance measure, is intially at least, in the mind of the designer of the machine, or in the mind of the users the machine is designed for, some agent have an explicit representation of the performance measure, while in other designs the performance measure is entirely implicit - the agent may do the right thing, but it does not know why -As a general rule, it is better to design performanace measures according to what one actually wants to be achieved in the environment, rather than according to how one thinks the agent should behave -Rationality maximizes expected performance while omniscience maximizes the actual performance -In designing an agent, the first step must always be to specify the PEAS as fully as possible

Types of agents: (Each figure)

Table-driven agents: -These agents are based on a pre-defined table of conditions and actions, and they simply match the current situation to the appropriate action in the table. These agents do not have the ability to learn or reason beyond their pre-defined rules. -An example of a table-driven agent is a thermostat that turns on or off based on the temperature readings.

funtion TABLE-DRIVEN-AGENT (percept) returns an action persistent: percepts, a sequence, intially empty table, a table of actions, indexed by percept sequences, initially fully specified
```
 append percept at the end of percepts
 action <- LOOKUP(percepts, table)
 return action	
```
Simple reflex agents: -These agents operate based on a set of if-then rules, mapping states to actions. However, they do not have the ability to consider the future consequences of their actions. -An example of a simple reflex agent is a traffic light that changes colors based on the presence of vehicles.

function SIMPLE-REFLEX-AGENT (percept) returns an action persistent: rules, a set of condition-action rules
```
 state <- INTERPRET-INPUT(percept)
 rule <- RULE-MATCH(state, rules)
 action <- rule.ACTION
 return action
```
Model-based reflex agents: -These agents maintain an internal model of the world and use it to make decisions based on the current state and future consequences. -An example of a model-based reflex agent is a chess-playing computer program that uses a model of the board and possible moves to choose its next move.

function MODEL-BASED-REFLEX-ACTION (percept) returns an action persistent: state, the agent’s current conception of the world state transition_model, a description of how the next state depends on the current state and action sensor_model, a description of how the current world state is reflected in the agent’s percepts rules, a set of condition-action rules action, the most recent action, initially none
```
 state <- UPDATE-STATE(state, action, percept, transition_model, sensor_model)
 rule <- RULE-MATCH(state, rules)
 action <- rule.ACTION
 
```

(yesma transition model lai what my actions do? and how world evolves bhanna milxa ni hai?)

Goal-based (model-based) agents: -These agents have a set of goals they are trying to achieve, and they make decisions based on those goals. They can reason about future actions and the consequences of those actions. -An example of a goal-based agent is a personal assistant that schedules your meetings based on your preferences and availability.
Utility-based (model-based) agents: -These agents are similar to goal-based agents, but they also consider the relative value or utility of achieving different goals. They can make trade-offs between competing goals based on their importance. -An example of a utility-based agent is a self-driving car that chooses between different routes based on factors such as travel time, safety, and fuel efficiency.

(same hun duita last ko duita haru just the difference being that, model based chai conditional hunxa, goal based goal ko najik jana khojxa, utility based le utility lai optimize garna khojxa)

(aba model ta banaiyo tara yiniharu ko rules chai kasle bhanxa)

Learning Agents: [FIG] -Alan Turing proposes to building learning machines and then teach them which is preferred method for creating state-of-the-art systems -Any type of agent (model-based, goal-based, utility-based etc.) can be built as a learning agent [FIG] -The learning element is responsible for making improvements, uses feedback from the critic on how the agent is doing and determines how the performance element should be modified to do better in the future -The design of the learning element depends very much on the design of the performance element which is responsible for selecting external actions, when trying to design an agent that learns a certain capability, the first question is not “How am I going to get it to learn this?” but “What kind of performance element will my agent use to do this once it has learned how?” -The critic is necessary because the percepts themselves provide no indication of the agent’s success, for example, a chess program could receive a percept indicating that it has checkmated its opponent, but it needs a performance standard to know that this is a good thing; the percept itself does not say so -Problem generator is responsible for suggesting actions that will lead to new and informative experiences, if the performance element had its way, it would keep doing the actions that are best, given what it knows, but if the agent is willing to explore a little and do some perhaps suboptimal actions in the short run, it might discover much better actions for the long run -The learning element can make changes to any of the components, observation of paris of successive states of the environment can allow the agent to learn “What my actions do?” and “How the world evolves?” in response to its actions -For example: (taxi driver le brake lai yeti saaro dabayo bhane yeti deccearation hunxa, problem generator le chai ajhai explore gar bhanxa, plus arko example ma tips payena bhane negative reward ho bhanni bujhna paryo)

Problem-solving agents: -When the correct action to take is not immediately obvious, an agent may need to plan ahead: to consider a sequence of actions that form a path to a goal state, such an agent is called a problem-solving agent, and the computational process it undertakes is called search -With access to information about the world, an AI agent can follow four-phase problem-solving process: 1. Goal Formulation: -Goals organize behavior limiting the objectives and hence the actions to be considered 2. Problem Formulation: -The agent devises a description of the states and actions necessary to reach the goal—an abstract model of the relevant part of the world 3. Search: -Before taking any action in the real world, the agent simulates sequences of actions in its model, searching until it finds a sequence of actions that reaches the goal, such a sequence is called a solution 4. Execution: -The agent can now execute the actions in the solution, one at a time

-The problem can be formally defined as:

1. A set of possible states that the environment can be in, call this the state space
2. The initial state that the agent starts in
3. A set of one or more goal states, sometimes defined by a property that applies to many states
4. The actions available to the agent, given a state s, ACTIONS(s) returns a finite set of actions that can be executed in s
5. A transition model, which describes what each action does, RESULT(s,a) returns the state that results from doing action a in state s
6. An action cost function, denoted by ACTION-COST(s,a,s') that gives the numeric cost of applying action a in state s to reach state s', assume all action costs will be positive, to avoid certain complications

-A sequence of actions forms a path, and a solution is a path from the initial state to a goal state, an optimal solution has the lowest path cost among all solutions
-The state space can be represented as a graph in which the vertices are states and the directed edges between them are actions
-Standard problems:
	>Two-cell vacuum world (8 states)
	>Sliding-tile puzzle (8^9 states)

Production systems (from another book): -

Search solutions: -State space describes the (possible infinite) set of states in the world, and the actions that allow transitions from one state to another, while search tree describes paths between these states, reaching towards the goal, -The search tree may have multiple paths, but each node in the tree has a unique path back to the root (as in all trees) [bhanesi tree ma states repeat ni huni bhayo] -The root node of the search tree is at the initial state, can expand the node, by considering the available ACTIONS for that state, using the RESULT function to see where those actions lead to, and generating a new node for each of the resulting states, choosing which of the childs to consider next is the essence of search

-As the trees grows, the set of unexpanded nodes form the frontier of the search tree, which separates the two regions of the state-space graph: an interior region where every state has been expanded, and an exterior region of states that have not yet been reached
[types of states are reached and unreached where reached can be further dvided into expanded and unexpanded]

1. Best-first search:
	-We choose a node, n with minimum value of some evaluation function, f(n), on each iteration we choose a node on the frontier with minimum f(n) value, return it if its state is a goal state, and otherwise apply EXPAND to generate child nodes, each child node is added to the frontier if it has not been reached before, or is re-added if is now being reached with a path that has a lower path cost than any previous path
	-The algorithm returns either an indication of failure, -or a node that represents a path to a goal.
	
	Data structures
	-A node in the tree is represented by a data structure with four components:
		1. node.STATE: 	
			-the state to which the node corresponds
		2. node.PARENT:
			-the node in the tree that generated this node
		3. node.ACTION:
			-the action that was applied to the parent's state to generate this node
		4. node.PATH-COST:
			-the total cost of the path from the initial state to this node
			
	-Need an appropriate queue of some kind to store the frontier:
		1. IS-EMPTY(frontier)	
			-returns true only if there are no nodes in the frontier
		2. POP(frontier):
			-removes the top node from the frontier and returns it
		3. TOP(frontier):
			-returns but does not remove the top node of the frontier
		4. ADD(node, frontier) 
			-inserts node into its proper place in the queue
	
	-Reached states can be stored as a lookup table where each key is a state and each value is the node for that state
	
	(euta place ma gayera feri tesma back jani pani ta euta actions nai ta ho)
	Redundant paths:
	-A redundant path in a search algorithm is a path that has already been explored or considered during the search but is revisited again leading to unncessary computation and slowing down the search process
	-A cycle is a special case of redundant path? it's just a worse way to get to the same state - and need not be considered in our quest for optimal paths, can lead to an infinite loop or an endless cycle of exploration
	-Three approaches:
		1. Remember all previously reached states (as best-first does), allowing us to detect all redundant paths, and keep only the best path to each state, appropriate for state spaces where there are many redundant paths, and is the preferred choice when the table of reached states will fit in memory
		2. Not worry about repeating the past, for some problem formulations it is rare or impossible for two paths to reach the same state, call a search algorithm a graph search if it checks for redundant paths and a tree-like search if it does not
		3. Can compromise and check for cycles, but not for redundant paths in general, since each node has a chain of parent pointers, we can check for cycles with no need for additional memory by following up the chain of parents to see if the state at the end of the path has appeared earlier in the path, some implementation follow this chain all the way up, and thus elimiate all cycles; other implementations follow only a few links
		
	(redundant paths are problematic in the sense that slow banauxan so remove gareko ramro, cycles le ta stuck nai garauna sakxan)
	
[aba graph search bhaye ta chill ho cycle ko tension bhayena tara tree search bhaye ta cycle ta hatauna arxa yar, kinaki cycle ma stuck bhayo bhane ta algorithm le solution nai didaina]

[tree wala ma chai redundant le time waste garey ni baal bhayena]

Measuring problem-solving performance: -Criteria used to choose among solutions:

2. Completeness:
	-Is the algorithm guaranteed to find a solution when there is one, and to correctly report failure when there is not?
	-In a search problem with a single goal which could be anywhere in the state space, therefore a complete algorithm must be capable of systematically exploring every state that is reachable from the initial state, in finite space that is straightforward to achieve: as long as we keep track of paths and cut off ones that are cycles, eventually will reach every reachable state
	-In infinite state spaces, must be systematic in the way it explores an infinite state space, making sure it can eventually reach any state that is connected to the initial state
	[infinite space ma solution xainan bhane completeness ko meaning xaina]
	
3. Cost optimality:
	-Does it find a solution with the lowest path cost of all solutions?	
	
4. Time complexity:
	-How long does it take to find a solution? This can be measured in seconds, or more abstractly by the number of states and actions

5. Space complexity:
	-How much memory is needed to perform the search?
	-In many AI problems, the graph is represented only implicitly by the initial state, actions and transition model, so can be measured in terms of d, the depth or number of actions in an optimal solution; m, the maximum number of actions in any path; and b, the branching factor or number of successor of a node that need to be considered

Uninformed search strategies: -An uninformed search algorithm is given no clue about how close a state is to the goal(s)

2. Breadth-first search:
	-When all actions have the same cost, an appropriate strategy is breadth-first search, in which the root node is expanded first, then all the successors of the root node are expanded next, then their successors, and so on
	-Could implement breadth-first search as a call to BEST-FIRST-SEACH where the evaluation function f(n) is the depth of the node - that is the number of actions it takes to reach the node, can get additional efficiency with a couple of tricks [...]
	-[FIG of BFS in BT]
	
	(action cost ta sablai equal xa so we no longer consider them okay)
	Performances:
	1. Systematic search strategy that is therefore complete regardless of infinite state spaces
	2. Finds a solution with a minimal number of actions so cost-optimal for problems where all actions have the same cost, but not for problems that dont have that property, but complete in either cases
	3. Suppose root of the tree generates b nodes, each of which generates b more nodes, for a total of b^2 at the second level, each of these generates b mores nodes, yielding b^3 nodes at the third level, and so on, if the solution is at depth d, then total number of nodes generated is O(b^d)
	4. All the nodes remain in memory, so both time and space complexity are O(b^d), memory requirements are bigger problem than the execution time
	
(breadth first search ma reached ko list maintain garinxa optimized tarika le so is a graph search)
	
3. Dijkstra uniform-cost:
	-...
	
4. Depth-first search:
	-Depth-first search always expands the deepest node in the frontier first
	-Could be implemented as a call to BEST-FIRST-SEARCH where the evaluation function f is the negative of the depth, but usually implemented not as a graph search but as a tree-like search
	-[FIG of DFS in BT]
	
	Performances:
	1. In infinite state spaces, can get stuck going down an infinite path, even if there are no cycles
	[aba tree type ko search ho so cycles ma ni stuck ta huni nai bhayo ni]
	2. Returns the first solution it finds, even if it is not cheapest
	3. A depth-first tree-like search takes time proportional to the number of states: O(b^m)
	4. Memory complexity of only O(bm), where b is the branching factor and m is the maximum depth of the tree
	
	[think of the frontier in breadth-first search as the surface of an ever-expanding sphere, while the frontier in depth-first search is just a radius of the sphere]

	[depth first lai action cost bhaye ni nabhaye ni baalai xaina]

5. Depth-limited:
	-To keep depth-first search from wandering down an infinite path, we can use depth-limited search, a version of depth-first search in which we supply a depth limit, l, and treat all nodes at depth l as if they had no successors, time complexity is O(b^l) and space complexity is O(bl), unfortunately if we make poor choice for l the algorithm will fail to reach the solution, making it incomplete again
	-Since depth-first search is a tree-like search, we can't keep it from wasting time on redundant paths in general, but can eliminate cycles at the cost of some computation time, if we look only a few links up in the parent chain we can catch most cycles; longer cycles are handled by the depth limit

(with good choice of l solves infinitest and cycle removing depth-limited can be complete, solving one issue)

6. Iterative deeping:
	-Solves the problem of picking a good value for l by trying all values: 0 first, then 1 then 2, and so on - until either a solution is found, or the depth-limited search returns the failure
	-Combines many of the benefits of depth-first and breadth-first search, is optimal for problems where all actions have the same cost
	Performances:
	1. Complete on finite acyclic state spaces, or on any finite space when we check nodes for cycles all the way up the path
	2. Only if costs are identical
	3. States near the top of the search tree are re-generated multiple times, in the worst case gives a time complexity of O(b^d) - asymptotically the same as breadth-first-search
	4. Memory requirements are modest: O(bd) when there is a soltuion, or O(bm) on finite state spaces with no solution

(makes good choice of l on iteration and cycle removation solves completeness, except for infinite spaces, if all equal costs then solves optimality) 

7. Bidirectional:
	[...]

Informed (Heuristic) search: -Uses domain-specific hints about the location of goals - can find solutions more efficiently than an uninformed strategy, hints come in the form of a heuristic function, denoted by h(n) (why n not s?):

h(n) = estimated cost of the cheapest path from the state at node n to a goal state
	
-For example, in route-finding problems, we can estimate the distance from the current state to a goal by computing the straight-line distance on the map between the two points


8. Greedy best-first search:
	-Expands the node with the lowest h(n) value - the node that appears to be closest to the goal - on the grounds that this is likely to lead to a solution quickly
	-Evaluation function f(n) = h(n)
	
	Performances:
	1. Complete in finite state spaces, but not in infinite ones
	2. Not necessarily
	3. Worst-case time and space complexity is O(|V|), with a good heuristic function on certain problems can be O(bm)
	
9. A* search:
	-Uses the evaluation function:
		f(n) = g(n) + h(n)
	where, 
		g(n) is the path cost from the initial state to node n,
		h(n) is the estimated cost of the cheapest path from n to a goal state
		
	thus,
		f(n) = estimated cost of the best path that continues from n to goal
	
	
	Performances:
	1. Complete (obviously if state space either has a solution or is finite)
	2. Cost-optimalilty depends on certain properties of the heuristic
		>if heuristic is admissible that never overestimates the cost to reach a goal (therefore optimistic), if h(n) is consistent if given h*(n) is the optimal cost to reach the nearest goal state from state n then
			h(n) <= h*(n)
		>slightly stronger property is consistency, h(n) is consistent if, for every node n and every successor n' of n generated by an action a, we have
			h(n) <= c(n,a,n')+h(n')
		which is a form of the triangle inequality, stipulates that a sides of a triangle cannot be longer than the sum of the other two sides, example includes straight line distance in routing problems
		
		Proof:
		>Suppose the optimal path has cost C* but the algorithm returns a path with cost C > C*, then there must be some node n which is on the optimal path and is unexpanded becuase if all the nodes on the optimal path had been expanded, then we would have returned that optimal solution, call it n*
		>Using the notation g*(n) to meant the cost of the optimal path from the start to n, and h*(n) to mean the cost of the optimal path from n to the nearest goal,
		(paila non-optimal solution reutrn garxa bhanni assume garim ni ta hamle)
		| f(n*) > C*
		| f(n*) > g*(n*) + h*(n*)
		
		But also,
		| f(n*) = g(n*) + h(n*); by definition
		| f(n*) = g*(n*) + h(n*); because n is on an optimal path
		| f(n*) <= g*(n*) + h*(n*); because of admissibility, h(n) <= h*(n)
	
	3. 
	4.
	

X: Visualize searches:
	-Is to draw contours in the state space of f(n) = g(n) + h(n) as A* expands the frontier node of lowest f-cost, see that A* search fans out fromt the start node, adding nodes in concentric bands of increasing f-cost
	...

Local Search and Optimmizations: -Operate by searching from a start state to neighboring states, without keeping track of the paths, nor the set of states that have veen reached, meaning they are not systematic never exploring a portion of the search space where a solution acually resides -Two key advantages: they use very little memory and they can often find reasonable solutions in large or infinite state spaces for which systematic algorithms are unsuitable -Can solve optimization problem, in which the aim is to find the best state according to an objective function -In a state-space landscape, if elevation corresponds to an objective function, then the aim is to find the highest peak - a global maximum - and we call the process hill climbing, if elevation corresponds to cost, then the aim is to find the lowest valley - a global minimum - and we call it gradient descent

1. Hill climbing:
	-Hill-climbing keeps track of one current state and on each iteration moves to the neighboring state with highest value (of the evaluation function) - that is, it heads in the direction that provides the steepest ascent, terminates when it reaches a peak where no neighbor has a higher value
	-Does not look ahead beyond the immediate neighbors of the current state, resembles trying to find the top of Mount Everest in a thick fog while suffering from amnesia
	-One way to use hill-climbing search is to use the negative of a heurestic cost function as the objective function; that will climb locally to the state with smallest heuristic distance to the goal
	-Example:
		>Every state has 8 queens on the board, one per column, initial state is chosen at random, and the successors of a state are all possible states generated by moving a single queen to another square in the same column (56 successors)
		>Heuristic cost function h is the number of pairs of queens that are attacking each other; this will be zero only for solutions
		>[FIG]
		>Takes five steps to reach the state in which h = 1 which is nearly a solution
	-Can get stuck for any of the following reaons:
	A. Local maxima:
		-A local maximum is a peak that is higher than each of its neighboring states but lower than the global maximum, if reach the vicinity of a local maximum will be drawn upward toward the peak but will then be stuck with nowhere else to go
	B. Ridges: 
		-Ridges result in a sequence of local maxima that is very difficult for greedy algorithms to navigate
	C. Plateaus:
		-A flat area of the state-space landspace, can be a flat local maximum, from which no uphill exit exists, or a shoulder, from which progress is possible
		
	-Starting from a randomly generated 8-qeens state, steepest-ascent hill climbing gets stuck 86% of the time, solving only 14% of problem instances, taking just 4 steps on average when it succeeds and 3 when it gets stuck
	-Allow sideways move in limit when we reach a plateau in the hope that is really a shoulder, raises the percentage of problem instances solved from 14% to 94%, averages roughly 21 steps for each successful instance and 64 for each failure
	-Success of hill climbing depends very much on the shape of the state-space landspace
	-Another variant is stochastic hill climbing which chooses at random from among the uphill moves; the probability of selection can vary with the steepness of the uphill move, usually converges more slowly than steepest ascent, but in some some state landscapes, finds better solutions
	
	
2. Simulated annealing:
	-A hill-climbing algorithm that never makes downhill moves toward states with lower value is always vulnerable to getting stuck in a local maximum, in contrast, a purely random walk that moves to a successor state without concern for the value will eventually stumble upon the global maximum, but expremely inefficient, seems reasonable to combine hill climbing with a random walk in a way that yields both efficiency and completeness
	-Switch our point of view from hill climbing to gradient descent with task of getting ping-pong into the deepest crevice in a very bumpy surface
	-If we just let the ball roll, will come to rest at a local minimum, trick is to shake just hard enough to bounce the ball out of local minima but not hard enough to dislodge it from the global minimum, simulated-annealing solution is to start shaking hard and then gradually reduce the intensity of the shaking
	-Structure is similar to hill climbing, instead of picking the best move, picks a random move, if the move improves the situation, it is always accepted, otherwise, the algorithm accets the move with some probability less than 1
	-The probability decreases exponential with badness of the move, the amount dE by which evaluation is worsened, also decreases as the schedule T (analogous to temperature) goes down, bad moves are likely to be allowed at the start when T is high, and they become more unlikely as T decreases
	-If schedule lowers T to 0 slowly enough, then a property of the Boltzmann distribution, e^{-dE/T} is that the probabliity is concentrated on the global maximum, which the algorithm find with probability approaching 1
	
	
[...a myriad of algorithms...]

Adversarial search and games: -Cover competitive environments, in which two or more agents have conflicting goals, giving rise to adversarial search problems -Three instances towards multi-agent environments: 1. Appropriate when there are a large number of agents, is to consider them in an aggregate as an economy, allowing us to do things like predict that increasing demand will cause prices to rise, without having to predict the action of any individual agents 2. Second, consider adversarial agents as just a part of the environment - a part that makes the environment nondeterministic, but if we model the adversaries in the same way that, say, rain sometimes falls and sometimes doesn’t, we miss the idea that our adversaries are actively trying to defeat us, whereas the rain supposedly has no such intention 3. Explicitly model the adversarial agents with the techqniques of game-tree search

Two player zero-sum games:
	-Most commonly studied within AI are games that are deterministic, two-player, turn-taking, perfect information, zero-sum games
	-
	...
	A game can be formally defined with the following elements:
	1. S0: The initial state, which specifies how the game is set up at the start
	2. TO-MOVE(s): The player whose turn it is to make move in state s
	3. ACTIONS(s): The set of legal moves in state s
	4. RESULT(s,a): The transition model, which defines the state resulting from taking action a in state s
	5. IS-TERMINAL*s): A terminal test, which is true when the game is over and false otherwise, states where the game has ended are called terminal states
	6. UTILITY(s,p0): A utility function which defines the final numeric value to player p when the game ends in terminal state s
	
	(name of the players is MAX and MIN for some reason)
	
	-Given a game tree, the optimal strategy can be determiend by working out the minimax value of each state in the tree, which we write as MINIMAX(s), the minimax value is the utility for MAX of being in that state, assuming that both players play optimally from there to the end of the game
	-In a non-terminal state MAX prefers to move to a state of maximum value when it is MAX's turn to move, and MIN prefers a state of minimum value (that is, minimum value for MAX and thus maximum value for MIN)
	-Minimax is a recursive algorithm that proceeds all the way down to the leaves of the tree and then backs up the minimum values through the tree as the recursion unwinds
	-(For example bhanera lekhni yeta)
	-Yo euta depth first search ho so depth first kai compelxity hunxa, exponential bhako hunale gaad huna


Alpha beta pruning:
	-Cannot eliminate the exponent but sometimes cut the branching factor in half, computing correct minimax decision without examining every state by pruning large parts of the tree that make no different to the outcome
	-General principle is this: consider a node n somewhere in the tree, such that player has a choice of moving to n, if player has a better choice either at the same level or at any point higher up in the tree then player will never to to n, so once we have found out enough about n (by examining some of its descentants) to reach this conclusion, we can prune it
	-Alpha beta pruning gets its name from the two extra parameters in MAX-VALUE(state, alpha, beta)
	where,
		alpha = the value of the best (ie. highest-value) choice we have found so far at any choice point along the path for MAX, alpha = at least
		beta = the value of the best (lowest value) choice we have found so far at any choice point along the path for MIN, think beta = at most
		
	-ALpha beta updates the values of alpha and beta as it goes along and prunes the remaining brances at a node ie. terminates the recursive call as soon as the value of the current node is kwown to be worse than the current alpha or beta value for max or min respecitlvey

Constraint satisfaction problems: -From the point of view of the search algorithm, each state is atomic, or indivisible - a black box with no internal structure -Open blak box by using a factored representation for each state: a set of variables, each of which has a value, a problem is solved when each variable has a value that satisfies all the constraints on the variable, such problems are called constraint satisfacton problem -CSP search algorithms take advantage of states and use general rather than domain-specific heuristics to enable the solution of complex problems, main idea is to eliminate large portions of the search space all at once by identifying variable/value combination that violate the constraints, additional advantage that the actions and transition model can be dudced from the problem description -A CSP consists of three components, X, D and C: 1. X is a set of varialbes, {X1, …, Xn} 2. D is a set of domains, {D1, …, Dn}, one for each variables 3. C is a set of constraints that specify allowable combination of values

-A domain, Di, consists of a set of allowable values, {v1, ..., vk} for variable Xi, for example a Boolean variable would have the domain {true, false}
-Each constraint Cj consits of a pair <scope,rel> where scope is a tuple of variables that participate in the constraint and rel is a relation that defines the values that those variables can take on, if X1 and X2 both have the domain {1,2,3}, then the constraint saying that X1 must be greater than X2 can be written as:
<(X1,X2), {(3,1),(3,2),(2,1)}> or as <(X1,X2), X1>X2>
-CSPs deal with assignments of values to variables, a solution to a CSP is consistent, complete assignment
-Helpful to visualize a CSP as a constraint grpah, nodes of the graph correspond to variables of the problem, and an edge connects any two variables that participate in a constraint
4. CSPs yield a natural representaton for a wide variety of problems; it is often easy to formulate a problem as a CSPS
5. Years of development work have gone into making CSP solvers fast and efficient
6. A CSP solver can quickly prune large swathes of the search space that an atomic state-space searcher cannot
-In atomic state-space search we can only ask: is this specific state a goal? No? What about this one? With CSPs, once we find out that a partial assignment violates a constraint, we can immediately discard further refinements of the partial assignment

Knowledge based agents: -Form representations of a complex world, use a process of inference to derive new representations about the world, and use these new representations to deduce what to do -Humans seem to know things and what they know helps them do things, in AI knowledge-based agents use a process of reasoning over an internal representation of knowledge to decide what actions to take -Problem solving agents know things, but only in a very limited inflexible sense, they know what actions are available and what the result of performing a specific action from a specific state will be, but they dont know general facts -A route finding agent does not know that it is impossible for a road to be negative number of kilometers long, the knowledge they have is very useful for finding a path from the start to a goal, but not for anything else -Knowledge-based agents can accept new tasks in the form of explicitly described goals; they can achieve competence quickly by being told or learning new knowledge about the environment; and they can adapt to changes in the environment by updating the relevant knowledge -Logic is a general class of representations to support knowledge-based agents who can combine and recombine information to suit myriad purposes

-Central component of a knowledge based agent is its knowledge base, or KB is a set of sentences (not identical to the sentences of English and other natural languages), each is expressed in a language called a knowledge representation language
-Must be a way to add new sentences to the KB and a way to query what is known, standard operations are TELL and ASK, both may invovle inference - that is, deriving new sentences from old

	function KB-AGENT(percept) returns an action
		persistent: KB, a knowledge base
			    t, a counter, initially 0, indicating time
			    
		TELL(KB, MAKE-PERCEPT-SENTENCE(percept, t))
		action <- ASK(KB, MAKE-ACTION-QUERY(t))
		TELL(KB, MAKE-ACTION-SENTENCE(action, t))
		t <- t+1
		return action
-Each time the agent program is called, it does three things:
	1. It TELLS the knowledge base what is perceives
	2. It ASKS the knowledge base what action it should perform, in the process of answering the query, extensive reasoning may be done about the current state of the world, about the outcomes of possible action sequences, and so on
	3. It TELLS the knowledge base which action was chosen, and returns the action so that it can be executed
	
	
-MAKE-PERCEPT-SENTENCE constructs a sentence asserting that the agent perceived the given percept at the given time
-MAKE-ACTION-QUERY constructs a sentence that asks what action should be done at the current time

>A knowledge-based agent can be built simply by TELLing it what it needs to know, starting with an empty knowledge base, the agent designer can TELL sentences one by one until the agent knows how to operate in its environment, known as declarative approach to system building
>In contrast, the procedural approach encodes desired behaviors directly as program code, for a successful agent often combines both declarative and procedural elements in its design


Logic:
	-Knowledge bases consist of sentences wherin a logic defines the sematics, or meaning, of the sentences, in standard logics, every sentence must be either true or false in each possible world - there is no in between
	(means ki ta true huna paryo ki ta false huna paryo bhanya hawa)
	-When need is to be precise, the term model is in place of possible world, whereas possible worlds might be though as real environments that the agent might or might not be in, models are mathematical abstractions, each of which has a fixed truth value for every relevant sentence
	-If a sentence /a is true in model m, say that m satisfies a or sometimes m is a model of a, represent M(a) to mean that set of all models of a
	
Logic reasoning:
	-Logical entailment is the idea that a sentence follows logically from another sentence, a |= b to mean the sentence a entails the sentence b, formaly a |= b if and only if, in every model in which a is true, b is also true, we can write
		a |= b if and only if M(a) subset M(b)
	-The sentence x = 0 entails the sentence xy = 0, in any model where x is zero, it is the case that xy is zero, regardles of the value of y
	-Think of the set of all consequences of KB as a haystack and of a as a needle, entailment is like needle being in the haystack; inference is like finding it, write KB |-i a if an inference algorithm i can derive a from KB
	4. An inference algorithm that derives only entailed sentences is called sound or truth preserving, an unsound procedure essentially makes things up as it goes along - announces the discovery of nonexistent needles
	5. Property of compoleteness is desirable: an inference algorithm is complete if it can derive any sentence that is entailed, for finite KB, a systematic examination can always decide whether the needle is in the haystack, for infinite haystack of consequences, completeness becomes an issue
		
[bhanesi euta sematics representational language ra arko inference algo chaiyo]	
	
Propositinal Logic:
	-Syntax of propositional defines the allowable sentences, atomic sentences consist of a single proposition symbol, each such symbol stands for a proposition that can be true or false
	-Complex sentences are constructed from simpler sentences, using parentheses and operators called logical connectives
	-[not, and, or, implies, if and only if]
	-Think of P->Q as 'if P is true, then I am claiming that Q is true; otherwise I am making no claim', the only way for this sentence to be false is if P is true but Q is false

Simple inference procedure:
	-A inference procedure decides whether KB |= a for some sentence a; a simple model checking carries out logical inference as enumerates all possible models to check that a is true in all models in which KB is true, that is M(KB) subset M(a), works if the space of models if finite, is a direct implementation of the definition of entailment: enumerate the models
	-[aba KB ma ta sentences haru hunxa haina, ani sab variables lai assign garni true or false so 2^n combination garauni ani KB true bhayo ki nai herni, sab sentence true bhayo bhani KB true hunxa]
	-[ani tei variables le banako arko sentence check garni true chha ki nai]
	
	-Performs a recursive enumeration of finite space of assignments to symbols, is sound because it implements directly the definition of entailment, and complete because it works for any KB and a and always terminates - there are only finitely many models to examine
	-If KB and a contains n symbols in all, then there are 2^n models, the time complexity of the algorithm is O(2^n) and space is O(n) as enumeration is depth-first, NP-complete
	-Entailment can also be done by theorem proving - applying rules of inference directly to the sentenecs in our knowledge base to construct a proof of the desired sentence without consulting models
	
	-Principles (concepts)
	6. Logical equivance:
		-Two sentences a and b are logically equivalent if they are true in the same set of models, =- is used to make claims about sentences, while <-> is used as part of a sentence
		-Any two sentences a and b are equivalent if and only if each of them entails the other:
			a -=- if and only if a |= b and b |= a
		
	7. Validity:
		-A sentence is valid if it is true in all models, also known as tautologies - they are necessarily true
		-From the definition of entailment:
		| For any sentences a and b, (a |= b) if and only if the sentence a -> b is valid
		
	8. Satisfiability:
		-A sentence is satisfiable if it is true in, or satisfied by some model, can be checked by enumerating the possible models until one is found that satisfies the sentence (NP-complete)
		-Validity and satisfiability are connected: a is valid iff ~a is unsatisfiable; contrapositivley, a is satisfiable iff ~a is not valid, also
			a |= b if and only if sentence (a and !b) is unsatisfiable
		-Proving b from a by checking the unsatisfiability of (a and !b) corresponds exactly to the standard mathematical proof technqiue of reductio ad absurdum, called proof by contradcition
	
	[entailment, valid ra satisfiabiliy ra ek arka bich ko relations]

[aba chai inference procedures nai hun hai model checking jasatai]		

Inference and proofs:
	-Inference rules that can be applied to derive a proof - a chain of conclusions that leads to the desired goal
	9. Modus ponens:
	10. Modus tollens:
	11. Hypothetical syllogism:
	12. Disjunctive syllogism:
	13. Addition:
	14. Simplication:
	15. Conjuction:
	
	[essentially all of the]
	x. Logical Equivalences 
	[sense, distributive, implication eliminate, contrapositiove, demorgans]
	
[what bout completeness?]
Resolution:
	-A single inference rule, resolution, that yields a complete inference algorithm when coupled with any complete search algorithm
	[usual lekhni tarika:
		p or q
	      not p or r
	     Hence, q or r
	]
	-Unit resolution:
			l1 or ... or lk, m
		     l1 or li-1 or li+1 or lk
	where each l is a literal (positive or negative) and li and m are complimentary literals (i.e., one is the negation of the other), thus the unit resolution rule takes a caluse - a disjunction of literals - and a literal and produces a new clause
	-Can be generalized to the full resolution rule:
		l1 or ... or lk, m1 or ... or mn
l1 or ... or li-1 or li+1 or ... or lk or m1 or ... or mj-1 or mj+1 or ... or mn
	
	where li and mj are complementary literals, says that resolution takes two clauses and produces a new clause containing all the literals of the two original clauses execept the two complementary literals
	-The resulting clause should contain only one copy of each literal (kaile kai duita le cancel garnu parne huna sakxa, resolution le euta lai matra cancel garxa), removal of multiple copies of literals is called factoring
	
	-A resolution-based theorem prover can, for any sentences a and b in propositinoal logic, decide whether a |= b
	
CNF:
	-Resolution rule applies only to clauses (i.e, disjunction of literals), so is relevant only to KB consisting of clauses, 
	-Leads to complete inference for all of propositional logic as every sentence of propositional logic is logically equivalent to a conjunction of clauses
	-A sentence expressed as a conjunction of clauses is said to be in conjuctive normal form or CNF
	-Procedure for CNF:
		1. Eliminate <-> replacing a <-> wwith (a->b) and (b->a)
		2. Eliminate -> replacing a -> with not a or b
		3. Move not inwards by repeated application of DeMorgans
		4. Distribute or over and wherever possible

Resolution algorithm:
	-Inference procedures based on resolution work by using the principle of proof by contradcition, i.e to show KB |= a, is to show (KB and not a) is unsatisfiable
	-First (KB and ~a) is converted into CNF, then the resolution rule is applied to the resulting clauses, each pair that contains complementary literal is resolved to produce a new clause, which is added to the set if it is not already present, the process continues until one of two things happen:
	16. There are no new clauses that can be added, in which case KB does not entail a; or,
	17. Two clauses resolve to yield the empty clause, in which case KB entails a
	
	-The empty clause - a disjunction of no disjunts - is equivalent to False because a disjunction is true only if at least one of its disjunts is true
	[proof of complteness?]
	
Horn clauses:
	-Some real-world knowledge bases satisfy certain restrictions on the form of sentences they contain, which enables them to use a more restricted and efficient inference algorithm
	18. Definite clause:
		-Which is a disjunction of literals of which exactly one is positive
	19. Horn clause:
		-Slightly more general, which is a disjunction of literals of which at most one is positive, 
		-All definite clauses are Horm clauses
		-If you resolve two Horn clauses, you get back a Horn clause
		
	>Every definite clause can be written as an implication whose paramter is a conjunction of positive literals and whose conclusion is a single positive literal, is easier to understand, a sentence consisting of a single positive literal too can be writtein in implication form as True -> L
	>Inference with Horn clauses can be done through forward-chaining and backward-chaining algorithms, both of which are natural in that inference steps are obvious and easy for humans to follow, is basis for logic programming
	>Deciding entailment with Horn clauses can be done in time that is linear in the size of the knowledge base - a plesant surprise
	
Forward and backward chaining:
	Forward:
	-Determines if a single proposition symbol q - the query - is entailed by a knowledge base of definite clauses
	-Begins from known facts (positive literals) in the knowledge base, if all premises of an implication are known, then its conclusion is added to the set of known facts
	(aghi bhaenthoy ni sexy form ma laijana milxa bhanera tesko kkura ho)
	-The process continues until the query q is added or until no further inferences can be made
	-[Euta mast example ghoknu parxa]
	-Knowledge base drawn as an AND-OR graph where multiple edges joined by an arc indicate a conjuction while multple edges without an arc indicate a disjunction
	-The known leaves are set, and inference propagates up the graph as far as possible, wherever a conjuction appears, the propagation waits until all the conjucts are known before proceeding
	Performances:
		1. Is sound: every inference is essentially an application of Modus Ponens
		2. Is complete: every entailed atomic sentence will be derived
	-Consider final state of the inferred table after the algorithm reaches a fixed point where no new inferences are possible, the table contanis true for each symbol inferred during the process, and false for all other symbols
	-Forward chaining is an example of the general concept of data-driven reasoning—that is, reasoning in which the focus of attention starts with the known data
	-It can be used within an agent to derive conclusions from incoming percepts, often without a specific query in mind
	
	Backward:
	-The backward-chaining algorithm works backward from query, if the query q is known to be true, then no work is needed
	-Otherwise, the algorithm finds those implications in the KB whose conclusion is q, if all the premises of one of those implications can be proved by backward chaining recursively, then q is true
	-When applied to query q, it works back down the graph until it reaches a set of known facts A and B, that forms the basis for a proof
	-Is a form of goal-directed reasoning, useful for answering specific questions
	-Often, the cost of backward chaining is much less than linear in the size of the knowledge base, because the process touches only relevant facts.
	
	
	......
	
First-Order Logic:
	-...
	-Propositional logic asusmes that there are facts that either hold or do not hold in the world, each fact can be in one of two states - true or false - and each model assigns true or false to each proposition symbol
	-FOPL assumes more; namely that the world consists of objects with certain relations among them that do or do not hold
	-A relation is just the set of tuples of objects that are related
	-The crown is on King John's head so the 'on head' relation contains just one tuple, <the crown, King John>
	-Also person property could be there, the kind is person
	-Certain kinds of relationships are best considered as: in that given object must be related to exactly one object in this way, for example, each person has one left leg, so the model has a unary left leg' function - a mapping from a one element tuple to an object - that includes the floowing mappings:
		<Richard> -> Richard's left leg
		<King John> -> Johns left leg
	
	-Inference ma jaam sidhai:
	>To make inference we could conver the FOPL sentences to propostional logic , where first step is to eliminate universal quantifiers by generating each sentences for each object in the KB
	>Ani existential lai KB ma nabhako variables use garera hatauni, more geneal process is called Skolemization
	>The problem here is that universal le infinite sentences generate garna sakxa
	>But something is there and waht the fuck?
	
	---Okay aba arko rule hai ta
	>Instead of doing lafda, if there is some substitution theta that makes each of the conjucts in the premise of the implication identical to sentences already in the KB, then we can assert the conclusion of the implicaiton, after applying theta
	>Unification is that process of making logical expressions look identical
	

	-Lifted inference rules require finding substituitoins that make different logical expressions look identical, the process is called unification and is a key component
	-THe UNIFY algorithm takes two sentneces and returns a unifiers for them (a substitution) if one exists
	-UNIFY(p,q) = theta where SUBST(theta, p) = SUBST(theta, q)
	-For example: want to know all the objects who john knows ie.
		find x such that Knows(John, x)
	-It could be done by unifying:
		UNIFY(Knows(John, x), Knows(John, Jane))
		UNIFY(Knows(John, x), Knows(y, Bill)) and so son
	
	
	
	20. First step is to convert sentences to CNF, key is that every sentence of first order logic can be converted into an inferentially equivalent CNF sentence
	21. Procedure is same but the difference arises from the need to eliminate the existential quantifiers
	Example:
	>Everyone who loves all animals is loved by someone:
	>For all x [for all y Animal(y) -> Loves(x,y)] -> [there exists y Loves(y,x)]
	
	A. Elimiate implications: same way
	B. Move negation inwards
	
	becomes
	>Either there is some animal that x doesnot love or (if this is not the case) someone loves x
	>Standardize variables: variables ta dummy hun ni tei bhayera repeat huna sakxan but dont repeat
	>SKOLEMIZE:
		Skolemization is the process of removing existential quantifiers by elimination
		-bhannale universal quanitifers ko bhitra bhako existential lai yettikai remove garna mildena
		-for all x [there exists y Loves(x,y)] xa bhane we cannot jus remove y wala part
		-for all x [Loves(G(x), y)]; yesto garna painxa baru
		-Here F are the skolem functions, general rule is that arguments of the skolem functions are all universally quantified variables in whose scope the existential quantifiers appears
	>Drop the universal quantifiers
	>Distribute or over and:
	


Fuzzy learning:
	-Fuzzy logic is a mathematical framework that allows for reasoning with uncertain or imprecise information, is a form of multi-valued logic that allows for values between true and false, which are represented as degrees of membership in a fuzzy set
	-Fuzzy sets introduce a certain amount of vagueness to reduce complexity of comprehenstion, set consists of elements that signitfy the degree or grade of membership to a fuzzy aspect
	-Three types of sets: child young old, each are fuzzy set now, people of different ages chai some degree or grade of membership hunxa ni ta
	-Table: |AGE | CHILD | YOUNG | OLD|
	-Operations:
		1. Unions: maximum of two: uA union B = max(ua, ub) similary interseciton is minimum
		
	Advantages:
	22. Ability to handle uneratinlgY; can handle unceratin, vague, or incomplete information which is common in many real life
	23. Flexibilty: flexible frameowk for decision making by allowing degrees of truth or membership in a fuzzy set
	24. Robustness: robus in the face of variaiblty in input data, crisp is sensitive
	25. Natural naguage processing Interpret natural language which are uslaly vague or something
	26. Inutitve reaosning: human like reaosning
	27. Cost effecitive solutios: reduces compelxity
	
	
	-Steps and example of water cooler where you need to control the tempeature of the room, control the water flow rate to get temperature and speed of fan, bhannale fan ko speed ra temperature given hunxa we need to change the flow rate
	
	28. Fuzzification:
		-When real data is available we define profiles for each of these parameters by assigning memberships to their respective values
		-For example, we may say that when the tempeature is 25 degrees, its membership to the fuzzy set moderate is 1 (100%) but as we drift away from 30 dgegrees, its membeship to this set decreases while the same to the set warm starts to incrase
		-When the sensors report the values of temperature and fan speed, they are mapped based on their membership to the respective fuzzy regions they belong to
	
	29. Fuzzy rules and engine:
	
		-The fuzzy rules form the triggers of the fuzzy engine, after a study of system we could write linguistic rules akin to natural languages such as -
		1. If temperature is HOT and fan speed is LOW then flow rate is POSITIVE
		2. If temperature is COOL and fan motor is LOW then flow rate is NEGATIVE
		
	30. Defuzzification:
		-The fuzzy outputs LOW-POSITVE, POSITIVE and HIGH-POSITIVE are to be converted to a single crisp value which can then be delivered to final actuator of the pump
		-MIN Wala example dekhauni
		
	widely appied to control applications

Knowledge representation: -(Another book matches syllabus though) -Knowledge is a collection of beliefs that an AI agent holds about the world, in other words, refers to the information or understanding that is acquired through learning or experience -Is a complex and multifaceted concept that encompasses many different types of knowledge, including factual knowledge (e.g Paris is capital of France), procedular knowledge (How to ride bike), and conceptual knowledge (eg The concept of justice) -(humans le knowledge use garira hunxa tei bhayera knowledge ma focus gareko bhanni kura not just algorithmic followups) -Knowledge representaion is a scheme to represent diverse factors about the real world in a form that can be used to reason and solve problems -By representing knowledge in a structure and formal way, AI systems can reason, infer and make predictions based on that knowledge leads to more efficient as well as teh ability to learn and adapt to new situations -Particular knowledge representation models allow for more specific, more powerful problem solving mechanisms that operate on them -EUTA SEMANTICS WALA FIGURE TA CHAINXA YAR YETA -A good system for representation of knowledge in a particular domain should possess the folowing properities: 1. Representatioanal adequacy: the ability to represent all of the kinds of knowledge that are needed in that domain 2. Inferential adequacy: The abilty to manipulate the representatinal strucures in a way as to derive new structures corresponding to new knowledge inferred from te world 3. Inferential effieciency: abilty to incorporate into the knowledge structure additional informatino that can be used to focus the attention of the inference mechanisms in the most promising directions 4. Acquisitional efficiency: the ability to acquire new inforamtion easily, the simplest case involves direct insertion, by a person, of new knowledge into the database, ideally the program itself would be able to control knowledge acquistion

-Approahces:
1. Simple relational knowledge
	-Simplest way to represent declarative facts is a set of relations of the same sort used in database systems
	-(Euta talbe banauni Player name, height weight ra skills ko
	-Reason this is simple is that standing alone it provides very weak inferential capabilities but knowledge represented in this form may serve as the input to more powerful inference engines
	-Providing support for relatinal knowledge is what database systems are designed to do
	
	Pros:
		-Represent complex relationships between entites or objects
		-Used in databse to efficiently store and retrieve information
		
	Cons:
		-Weak inferential capabititlies
		-May not be suitable for representing certain types of knowledge such as rules or procedures


2. Inheritable knoweldge:
	-Knowledge is represented as a hierarcy of concepts, with each concept inheriting properties from its parent concept
	-Here example of additional baseball knowelge inserted into a structure that is so arranged
	-Lines represent attributes, boxed nodes represent objects and values of attributes of the objects, these values can also be viewed as objects with attribute and values and so on
	-It may also be called semantic network or collection of rame
	-The general atributes are isa, which is being used to show class inclusion, and the attribute instance, which is being used to show class membership
	-Now to respond to a query follow the instance lines, or go up the isa hierarchy
	
	Pros:
		-Easy to organize and classify information
		-Reuse of knowledge, strucutered, modular approahc
		
	Cons:
		-May not be suitable for ceratin knowledge such as rules or procedures
	
3. Inferential:
	-Knowledge is repreented as a set of rules or precudres for making inferences or drawing conclusions
	-For ex, in a medical diagnosis system, the system can infer a patient's condition based on the symptoms they exhibit
	-Yesma ta tei hamro first order logic haru nai bhaye haina ta, jasma hamle resolution with unsatisfiability lagayera inference garna sakinxa
	-Example ko lagi feri basketball nai thik xa j lekhey ni bhayo aba ta
	
	Pros:
		-Easy resoultion avaiable xa ni ta yesma
		-Widely used to learn from data and improve accurarcy over time
		-Mathematical backgrounds and theorem suppport
	
	Cons:
		-May not be suitable for some relatinoshis such as relationship betwewen entities
		-Does not allow uncertaintly, yo ta sablai bhanna milxa hehe
	
4. Procedural:
	-Knowledge is represented as a set of procedures or algorithms for performing specific tasks
	-Can be represented in many ways, commonly as code in some programming language such as LISP for doiong something
	-The machineuses the knowledge when it exectues the code to perform a taks
	-Gets low scores with respect to the proerpties of inferential adequacy because it is very difficult to write a program that can reason about another progrma's behavior
	-Comonly used technqiue for representing procedural knowledge in AI is the use of producion rules
	-Production rules, particularly ones that are augmented with information on how they are to be use, are more procedular than are the other representatio methods 
	
	Pros:
		-Good for repeatitive taks and all
	
	Cons:
		-Inferential adequacy xaina because difficult ot write

ISSUES:
	-Are any attributes of objects so basic that they occur in almost every problem domain? If there are, we need to make sure that htye are handled properly in each of the mechanisms we propose, if such attributes exist, what are they?
	-Are there any important relationships that exist among attributes of objects?
	-At what level should knowledge be represented? is there a set of primites into which all knowledge can be broken down? Is it helpful to sue such primitives?
	-How should set of objets be represnted?
	-GIven a large amount of knoweleg stored in a database, how can relevant parts be accessed when they are needed?
	

Semantic nets:
	-is a knowledge base that represents relations between concepts in a network
	-In a semantic net, information is represented as a set of nodes connceted to each other by a set of labelled arcs, which represent relationships among nodes
	-Main idea is that meaning of a concept comes from the ways in which it is connencted to other concepts
	-Example ta tei mathik ko basketball wala lekhdiye bhayo
	-The network contians examples of both the isa and isntance relations, as well as som e othe rdomain specifics relations
	-Euta hiearhcy bata badheko example dkehauna paryo ni ta hehe
	
	--Logic ra semantics nets ko relationships hai ta:
	1. Semantics are a natural way to represnt relationships that would appear as ground istances of binary predicates in predicate logic tyo ta bhaihalyo meaning:
		isa(person, mammal) lai kasari garauni ta
		team(ssdfs, jsdlfsd) lai kasari garauni ta
		
	2. But he knowledge represetned by predicates of other aritites can also be represnted as semantic nets, for example
		man(Marcus) as instance(marcus, man)
		
		
	3. Three or more place predicaetes can also be converted to binary form by creating one new object represnting the entire predicat e and them introducing binary predicates to descrite the relationsihip to this new object of each of the oringial arguments
	
		score(nepal, srilankda, 1-1) ma aba five ot aobjects hunxan:
		
		game name as a center which conencts with isa to game and otehrs surroudning the gamename
		

Frames:
	-A frame is a collection of attributes usually called slots and associated values and possibly constraints on values that describe some entity in the world
	-Sometimes a frame describes an entity in some absoulte sense; sometimes it represents the entity frm a particular point of view
	-A single frame alone is rarely useful isntaed, we build frame system out of collections of frames that are connected to each other by virtue of the fact that the value of an atrribute of one frame may be another frame	
	-The set theory provides a good basis for understanding frame systems, a frame represents either a set or an instance (an element of set)
	-Tei aghi mathiko wala lai consider garey bhaihalyo
	-Frames add a third dimension to the smeantic nets by allowing nodes to have structures
	-Each piece of information about a particular frame is held in a slot, can contain:
	4. Facts or data: values
	5. Procedures: IF NEEDED: deferred evaluation, iF ADDED: updates linked information
	6. Default values: for data and for procedures
	7. Other frames or subframes: other frames or subframes

	-Because a class represents a set, there are two kinds of attributes that can be associated with it, about the set itself, and there are attriutes that are to be inherited by each element of the set
	-Example of lagi euta list banauni plus diagramatically dekhamla k xa ra

Conceptual dependney and scripts:
	-CD is a theory of how to represne the kind of knowledge about events that is usually contained in natural language sentences
	-Goal is to represent knowledge in a way:
		>Facilitates drawing inferences from the sentences
		>Is indepdent of the langauge in which the sentences were orignially written
	-CD represenation of a sentence is built not out of primitives corresponding to the workds used in the sentences but rahter out of coneptual primites that can be combined to form meanings of words in any particular language	
	-Over semantic nets is that CD provides both a strucutre and a specific set of primitives, a a particular level of granulity out of which representation of particular pieces of inforamtion can be constructeud
	-I gave the man a book can be converted as:
	
	where symbols ko meaning explain gar yo ghokihaal
	
	-Some typical priitives:
	ATRANS: transfer of an abstract relationship (give advice)
	PTRANS: transfer of the physical location of an object (give book)
	MTRANS: transfer of mental informatoin (tell)
	SPEAK: production of sounds (speak)
	MOVE: movement of a body part by its owner
	GRASP:
	
Scripts:
	-A script is a structure that describes a stereotyped sequence of events in a particular context, 
	-A script consits of set of slots, associated with each slot may be some information about what kinds of values it may contain as well as default value to be used if no other inforamtion is avaiable
	-In script as comparison to frame we can make some more precise statements about its structure
	-Important concepts of scripts
	8. Entry condition: conditions that must in general be satisfied before the events described in the script can occur
	9. Result: condition that will be true after the events desribed in the script have occured
	10. Track: Specific variation on a more general pattern that is represented by this particular scripts
	11. Props: Slots representing the objects that are involved in the events described int he script, 
	12. Roles: Solts representing the people who are involved in the events desripbed in the script
	13. Scnces: actual sequences of events

Bayesian stuffs now: (from other book though) -FOr some kinds of problem solving, though, it is useful to be able to describe beliefs that are not certian but for which there is soem supporting evidence 1. The first class contains problems in which there is genuine randomness in the world, playing card games such as bridge or blackjack is good example of this class, although in these problems it is not possible to predict the world with certainint, some knowledge about the likelihood of various outcomes is available and we would like to explit it 2. The second class contains problems that could in principle be modeled using the techqniues in which there are many more possible exceptions than we can possibly enumerate explicltiyl,y, most common sense tasks fall into this cateogy, as do many expert reasoning tasks such as medical diagonois (the what now the how)

-An important goal for many problem solving systems is to collect evidence as the system goes along and to modify its behavior on the basis of the evidence, to model this behavior we need a statistical theory of evidence, which is known as Bayesian statisitcs
-The fundamental notion of Bayesian is that of conditional probabllitly:
-Bayesian theorem state that:
	P(Hi|E) = P(E|Hi).P(Hi)/sum from n=1 to k (P(E|Hn).P(Hn))
	where,

	P(Hi|E) = the probability that hypothesis Hi is true given evidence H
	P(E|Hi) = the probability that we will observe evidence E given that Hi is true
	P(Hi) = the a priroi probablity that hypothesis is true in the absence of any specific evidence
	
	k = number of possible hypothesis
	
	
-Real life example:
	Suppose for example, that we are interested in examining the geological evidence at particular location to determine whether that would be good place to dig to find a desired minraL
	-If we know the prior probabilites of finding each of the various minearls and we know the probabilites that if a minearl is present then certain physical charactersistics will be observed, we can use Bayes' formula to copute from the evidence we colelct how likely it is that various minearl are present
	
-Bayesian statistics provide an attractive basis for an uncertain reasoning system, as a result, several mechanisms such as Bayesin networks for exploiting its power while at the same time making it tractable have been developed


Bayesian networks:
	-Is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG)
	-Where each node in the graph represents a random variable and the directed edges between nodes represent probabilisitc dependencies between them
	-Full specifications:
		1. Each node corresponds to a random variable, which may be discrete or continuous
		2. Directed links or arrows connect pair of nodes, if there is an arrow from node X to node Y, X is said to be parent of Y
		3. Each node Xi has associated probabilit information theta(Xi | Parents(Xi)) that quantifies the effect of parents on the node using a finite number of paranters
	-THe inutitive meaning of an arrow is typically that X has a direct influence on Y which suggests that causes should be parents of effects
	-Are often used for probabilistic inference where given evidence about some variables, we can compute posteriror probability of other variables
	-For example, a Bayesian network could represnet the probabilistic relationship between diseases and symptotms, given sysmptoms the network can be used to comptute probabilities of the presence of various disesases
	-(Euta nice example xa hai: Alarms installed at a house which is for burgalary but ocasionally triggered by earthquakes, when alarm hits two neighbours john and marry could call)
	-NICE DIAGRAM IN HERE WITH SOME VALUES
	-Semantics time now: Assume that bayes net contains n variables X1, ... Xn, a generic entry in the join distribution is then P(X1=x1 and ... and Xn = xn) or P(x1, ..., xn) for short, the semantics of Bayes nets defines each entry in the join probabilsy as follows:
	
	P(x1, ..., xn) = product from i = 1 to n [ theta(xi | parents(Xi) ]

	-In this example, the probabilty that the alarm has sounded, but neither a burgalary nor an earthquake has ocured, and both John and mary call
	

Causal networks:
	-Are a restricted class of Bayesian networks that forbids all but causally compatible orderings
	-Consider the simplest Bayesian network imaginable, a single arrow Fire -> Smoke, it tells as that variables Fire and Smoke may be dependent, so one needs to speficy the prior P(Fire) and the conditional probaiblity P(Smoke|Fire) in order to speciy the doint distribution P(Fire, Smoke)s, however this distribution can be represented equally well by the reverse arrow Fire <- Smoke, using the appropriate P(Smoke) and P(Fire|Smoke) compute from Bayes's rule,
	-The idea that these two networks are equivalence, hence convey same information, evokes discomform and even resitstace, how could the convey same information when we know that the Fire causes Smoke and not the other way around?
	-We ask which responds to which, Smoke to Fire or Fire to Smoke?
	
	-From basics defintions:
	>At their core, causal networks consits of nodes and edges, nodes represent random variables or interest, and edge represents causal relationship betewen them, the direction of the edge indicates the direction of causality, with the arrow pointing fro mthe cause to the effect
	-Can also handle complex interactions between variables, such as feedback loops and indirect effects, by representing them as multiple causal paths
	-The do-calculus is a mathematical framework for reasoning about causal effects in causal networks, which allows us to estimate the causal effect of a variable on an outcome when an intervention has been applied to that variable. 
	-The do-operator, denoted as do(X = x), is used to represent the effect of an intervention on a variable X, which sets the value of X to x regardless of the value it would have taken in the absence of the intervention
	-One key rules in the do-calculus is the backdoor criterion which states that if there is a backdoor path between intervention variable and the outcome variable, we need to control for the confouding variables along the path to estimate the causal effect of the invertion

LEARNING (from another book): -Learning denotes the changes in the system that are adaptive in the sense that they enable the system to do the same taks or tasks drawn from the same population more efficiently and more effectively the next time -Learning covers a wide range of phenomena, at the one end of the spectrum is skill refinement, people get better at many tasks simply by practicing -At the other end of the spectrum lies knowledge acquisition, many AI programs draw heavily on knowledge as their source of power

1. Rote learning:
	-

2. Learning from advice:
	-

3. Learning by problem solving:
	-

4. Learning from examples:
	-The AI agent is presented with a aset of training examples, each consisting of input attributes and a corresponding output value, the learner then tries to induce a general function or rule that can predict the output value for new input values
	-Example of hand written digits, aru type ko learning would be hard because its hard to define the classes themsevles
	-One common approach to inductive learning is decision tree learning. A decision tree is a tree like structure where each internal node represents a test on a specific input attriubte, each branch represents the outcome of the test, and each leaf reprsensts the calss label
	-To learn a decision tree, we start with the root node and recursively split the data based on the most informative attribute at each level, until we reach a leaf node for each class
	-Here;s how decision tree learnign process for digit identification might work:
	
	1. Start with the root node, which represents the entire dataset
	2. Find the most informative attribute (pixel) that best separates the data into classes
	3. Create a child node for each possible outcome of the test eg if pixel value > 0.5 go left otherwise go right
	4. Recursively repeat steps 2-3 for each child node until all data is correctly classified or some stoping criteria is met
	
	-After learning the decision tree, we can use it to classify new, unseen images by following the path from the root to the appropriate leaf node based on the pixel vavaleus

6. Analogy:
	-Is a type of machine learning where an algorithm learns by recognizing and applying analogies between different domains
	-The basic idea is that if two domains share similar relationships between objects or concepts, then knowledge from one domain can be applied to other domain using analogical reasoning
	-Steps:
	
	1. Retrieve: The algorithm searches for a source analog, which is a known problem that shares similarities with the current problem
	2. Mapping: The algorithm identifies the mapping between the source analog and the current problem, this invovles identifying correspondences between the objects, relations and properties in the two domains
	3. Transfer: the algorithm applies the knowledge from the source analog to the current problem by transferring the mappings to create a soltuion
	4. Evaluate: the aglrotihm evaluates the solution to determine if it is correct and makes necessary adjustments
	
	-One of the key challenges of anlaogy based leanring is identifying appropraite source analogs and determining the appropraite mappings between two domains
	-Requires significatn domain knowledge and expertise on he part of the algorithm designed
	-EXAMPLE ko lagi angle ra line theorem prover le rleaiton banako wala bhaihalyo
		

7. Explanation based learning
-


ANOTHER TYPE OF CLASSIFICATION:

8. Supervised learning:
	-A type of machine learning in which an Ai system learns from labelled data, which means data that is already labeleed with correct output
	-The goal of superised learning is to learn a general rule that maps inputs to output which can then be used to predict the output for new, unseen inputs
	-Can be further divided into:
	>Regression: in regression, the output is a continuous value, the goal is to predcit a value based on a set of input featues, for example price of house based on its size
	>Classification: the output is a discrete value, goal is to predict a label class based on a set of input features for example whether an email is a spam
	
	Mathematical:	
		-We have a training set consnisting of input features X and output labels Y. We want to learn a function f(X) that maps the input features to output labels, we do this by minizmignt the error between the predicated output and the true output, typically done by using a loss funciton like mean ssqure error (MSE)
		
	Example:
		-Image classification whether dog ,c at or bear
		
9. Unsupervised:
	-Unsupervised learning is a type of machine learning in which the AI system learns from unlabeled data, which means data without any predefined output. The goal of unsupervised learning is to learn the underlying structure and patterns in the data.
	-Unsupervised learning can be further divided into two subcategories:

	Clustering: In clustering, the goal is to group similar data points together. For example, grouping customers with similar purchasing behavior into different segments.

Dimensionality Reduction: In dimensionality reduction, the goal is to reduce the number of features of the data while retaining as much information as possible. For example, reducing the number of features of an image while retaining its essential characteristics.

	Mathematical:
		-In unsupervised learning, we have a training set consisting of input features X, but no output labels y. The goal is to find patterns or structure in the data. This is typically done using techniques like clustering or principal component analysis (PCA).

	Example:
		-An example of unsupervised learning is market basket analysis. Given a dataset of customer transactions, the AI system can learn to group together items that are frequently purchased together (e.g., chips and salsa) to help retailers better understand customer behavior and optimize store layouts.
		
		
3. Reinforcement learning:
	-Reinforcement learning is a type of machine learning in which the AI system learns by interacting with an environment and receiving feedback in the form of rewards or penalties. The goal of reinforcement learning is to learn an optimal policy that maximizes the expected reward.
	
	Mathemtiacal:
		-In reinforcement learning, the AI system interacts with the environment through a sequence of actions. The system receives a state from the environment, takes an action based on that state, and receives a reward or penalty based on the action taken. The goal is to learn an optimal policy that maps states to actions that maximizes the expected reward. This is typically done using techniques like Q-learning or policy gradient methods.
		
	Example:
		-A robot that navigaes through thte maze
		
		
7(continuation): Neural net learning and genetic learning:
	-Efforts in machine learning to mimic animal learning at a neutral level
	-Neural network models are based on a computational brain metaphor a number of other techqniues are inspired by evolution called genetic algorithms

GENETIC LEARNING: -Genetic algorithm is a metaheusitic inspired by natural selection that belongs to the larger calss of evolutionary algorithms -Commonly used to generate high quality solutiosn to optimizations and search problems by relying on biologically inspired operators such as mutation, crossover and selection -Generate a set of random soultions and make them compete in anarea where only the fittest survive, each solution in the set is equivalent to a chromosome, a set of such solutions (chromosomes) forms a popoulation -The algorithm then uses three basic genetic operators reproduction, crosover and mutation together with fitness function to evovle a new population or the next generation -Starting from a random set of solutions the algorithm uses these operators and the fitness function to guide its search for the optimal solution -The fitness function gauges how good the solution in question is and provides a mesure to its adaptability or szvivability

4. Initialization:
	-Chromosome representation is one of the main challenges in oritenting the problem to suit the GA applciation
	-The population size depends on the nature of the problem but typically contains seevral hundreds of thousands of possible solution
	-Often randomly generated, allowing the entire range of possible solutions like search space

5. Selection:
	-During each successive generation, a portion of the existing population is selected to reproduce for a new generation, indiviaul solutions are selected through a fitness based process, where fitter solutions (measured by fitness functiosn) are typically more likely to be selected
	(rouletee wheel bata rey ticket scheudling jasto)
	-The fitness function is defined over the genetic representation and measures the quality of the represented solution
	-For isntance, in the knapsack solution, one wants to maximize the total value of objects that can be put in a knapsack of some fixed quanitty. A representation of a solution might be an array of bits where each bit represents a different object, and the value of the bit (0 or 1) represents whether or not the object is in the knapsack
	-The fitness is the sum of the values of all the objects in the knapsack
	

6. Crossover;
	-Next step is to generate a second gneration population of soltions from those selected through a combinatino of geneti coperators crossover and mutation
	-For each new solution to be produced, a pair of parent solutions is selected for breeding from the pool selected previously, by producing a child solution using the above methods of crossover and mutation, a new solution is created which typically shares many of the characterisitcs of its parents:
		1. One point corssover:
			-A point on both parents chromosomes is picked randomly and desinged a crossover point, bits to the right of that point are swapped between the two parent chormosomes
			
		2. Two point:
			-Duita point randomly choose garni bich ko swap garni
			
7. Mutation:
	-Euta bit randomly flip garni with probabilty of 1/lenght of chromosomes
	
8. New generation:
	-The indiviudal with better fitness values are selected from the parent and the child population to form the next gnereation

NEURAL NETS WORLD:

1. Hopfield network:
	-Is a form of (recurrent) neural network that sever as a associate memory systems
	-Distributed representation: A memory is stored as a patern of activations across a set of processign elements,
	-The elements in hopfield nets are binary threshold units ie. units only take on two different values for their states, and the value is determined by whether or not the units input exceed its threshold Ui
	-Each processing elements makes decision based only on its own local situation, all these local actions add up to a global soltuion
	-Units are conencted to each other with weighted, symmetric conenctions, a positively weighted connection indicates that the two units tend to activate each other, a negative conenection allows an active unit to deactive the neighbouring unit
	-Content addressabe: a number of patterns can be stored in a net word, to retrieve a pattern, need only to specify a protion of it, network will automatically finds the closest match
	-Fault tolerance: if a few processing elements misbehvae the network still functions properly
		
	-Works as follows:
	>A random unit is chosen, if any of its neighbors are active, the unit computes the sum of the weights on the conenctions to those active neighbors, if the sum is positive, the unit becomes active if the sum is positive the unit becomes active, otherwise it becomes inactive
	>Another random unit is chosen, and the process repeats unitll the network reaches a stable state i.e. until no more units can change state
	
	-The network can be though of as storing the patterns, Hopfield's major contribution was to show that given any set of weights and any initial state, his parallel relaxation algorithm wouldl eventually steer the netwrok into a stable state, no divergence of oscillation
	
	-To retrieve a pattern, we need only supply a portion of it, the network will settle down into the stable state that best matches the partial pattern
	
	Training:
		-Hopfield ko state ko euta enery hunxa rey, ra mathi ko works as follows run garda kheri chai lowest enery ma gayera basxa
		-Training a Hopfield net involves lowering the energy of states the net should remember
		-If we train a Hopfield net with five units so that state (1,-1,1,-1,1) is an minimum enetry state then if we give it (1,-1,-1,-1,1) then it will converge to (1,-1,1,-1,1)


	Hebbian learning:
		-Adjusts the weights between the neurons based on the correlation between their activities
		-Can be summarized as follows:
		1. If two neurons fire together, the connection between them should be strengthened
		2. If two neurons do not fire together, the connection between them should be wekeaned

		-(sab neurons sab sanga connect xan bhane following formula use garna sakinxa rey)
		Wn*n = sum from 1 to k [transpose of pattern]*pattern; gives a matrix
		
			where K is the number of patterns so sab patterns ko matrices lai add garni ho
		-MAST example raixa baru slides ma
		-Global energy nikalna lai sab pairs ko weights*one activation*another activatoin garni ho



Boltzmann machines:
	-Hopfield network ta minima ma laijanxa bhanni kura bujhim ni ta hamle aba teslai hill climbing jastai ta bhayo aba teslai simulated annealing sanga mix garam na ta
	-Units in Boltzamann machines udate their individual binary states by a stochastic rather than determinsitc rule, the probabilty that any given unit will be active is given by p:
		p = 1/(1+e^delta E/T)
	where dE is the sum of the units active input lines and T is the temperature or shceudling of the network
	-Is more time consuming than backpropagation beacuse o the complex annealing process, but it has some advantages
	-Can solve constraint satisfaction problem


	Tranining:
		-Most common learning algorithm for BOltzmann mahines is the contrastive divergence algorithm, which is a type of stochastic gradient descent
		-The contrastive divergence algorithm approximates the gradient of the log-likelihood function of the model using Markov Chain Monte Carlo methods, which are used to sample from the model distribution
		-The basic idea behind contrastive divergence is to adjust the wieghts of the Botlzmann machine to maximze the probabilty of the traning data, the algorithm starts with a random indial matrix and udpates the weights in the direction of the gradient of log likelihood funtion

Learning in neutral networks:
	
(Completment ko lagi Mculloth Pits pani haldim jun ma weights kei hudaina just euta threshold value matrxa hunxa, so basically inputs ko sum hunxa ani greater than or less check garxa)
How to train it?
Ans:
	Initialize to zero all the weights
	wn = wprevious + x*y
	
-Solely adjusts the weights based on the correlations between the inputs and outputs, does not explicityly try to minimize the classicaiotn error
	
	
2. Perceptrons:
	 -Perceptron is one of the earliest artificail neuron which models a neuron by taking a weighted sum of its input and sending the output if the sum is greater than some adjustable threshold value otherwise it sends 0
	 -[FIG]
	 -The inputs (x1,x2,...,xn) and connection weights (w1, w2, ..., wn) are typcally rreal values, both postiive and negative, if the presence of some feature x1 tends to caue the perceptron to fire, the weight wi will be psoive; if the feature xi inhibits the perceptron, the weight wi will be negative
	 -The perceptron itself consists of the weights, the summation processor, and the adjustable threshold processor
	 -learning is the process of modifying the values of the weights and the threshold, threshold lai ni weight banayera haldini 1 sanga and threshold 0 hunxa tespaxi
	 -Several perceptron can be grouped to perform complex fucntions
	 -Amazing property: whatever a perceptron can computer, it can learn to compute
	 
	 -Let x be an input vector (X1,... Xn) the weighted sumation function g(x) and the output function o(x) can be defined as:
	 	g(x) = sum from i = 0 to n :wixi
	 	
	 	o(x) = 1 if g(x) > 0 and g(x) < 0
	 	
	 -If g(x) is exactly zreo, the perceptron cannot decide whether to fire, a slight cahnge in inputs could cause the device to go either way, if we solve g(x) = 0 we get the equaion for two input perceptron g(x) = w0+w1x1+w2x2
	 	x2 = -w1/w2 x1 - w0/w2
	 -yo decision line ho or surface ho
	 -In perecptrons with many inputs, the decision susrface will be a hyperplane through the multidimensioanl space of possible input vectors
	 
	 Fixed increment perceptron learning: features (x1,...xn) and weights (w0, ...wn)
	 1. Create a perceptron with n+1 inputs and n+1 weights where the x0 is always set to 1
	 2. Initialize the weights (w0, w1, ... wn) to random values
	 3. Iterate through the training set, collecting all examples misclassified by the current set of weights
	 4. If all good then exit
	 5. Otherwise, computer the vector sum S of the misclassified input vectors where each vector has the form (x0,..xn). In creating the sum, add to S a vector x if x is an input for which the perceptron incorrectly FAILS to fire, but -x if x is an input for which the perceptron incorreectly fires (1 dinu parne thauma 0 diyo bhane x or 0 dinu parne thauma 1 diyo bhane -x)
	 6. Multiply sum by a scale factor n
	 7. Modify the weights (w0, w1, ... wn) by adding the elemnts of the vector S to them
	 8. Go to step 3
	 
	 
	 -Problem:
	 	>While the convergence theorem guranteed correct classification of linearly separatble data, most problems do not supply such nice data
	 	-One is XOR problem, gie two binary inputs, output 1 if exactly one of the inputs is on and output 0 otherwise
	 	-x1 vs x2 ko graph linearly separable nabhako dekhauni
	 
3. Adaline network:
	-Same stuffs as perceptron but now the output is not only 0 or 1 meaning that outer layer is removed 
	 -Learning chai same ho
	

	
4. Backpropagation networks:
	-Mutlilayer perceptron refers to newtorks consisting of multipele layers of perceptrons with threshold activation
	-An MLP consits of at least three layers of nodes: an input layer, a hidden layer and an output layer, excpet for the input nodes, each node is a neuron that uses a nonlinear activation
	(so activation funciton chai sab neurons ko aafno afno hudo raixa)
	-We can build any arbitrary combinational circuit out of those basic logial unit, in fact we are allowd to use feedback loops, we can build a general computer out of those 
	-The existence of hiddne units allows the network to develop complex feature detectors, or internal represntations
	-The unit in a backpropagation network requires a slightly different acitavation function from the perceptron, a back propagation unit still sums up its weighted inputs ut unlike perceptrons it produces a real value between 0 and 1 as output based on a sigmoid (or S-shaped) function, which is continous and differentialble reqired by the back propagation algorithm
	-Let sum be the weighted sum of the inputs to a unit, the equation for the unit's output is given by:
		output = 1/(1+e^-sum)
		
	-If sum is 0, the output is 0.5 in constract to the perceptron, where it must be either 0 or 1
	
	Backpropagation algorithm:
	Given: A set of input output vector pairs
	Compute: A set of weights for a three layer network that maps inputs onto corresponding outputs
	
	Says: A units in input layer so (0,...,A) including the default for thresholding, similarly (0,...,B) and (0,...,C)
		xj for activations of input, hj for hiddena and oj for output
		w1ij for weights of input-hidden 
		w2ij for weights of hidden-output
		
	1. Choose an input-output par, suppose the input vector is xi and the largest output vector is yi, assign activation levels to the input units
	2. Propagate the activations from the units in the input layers to the hidden layer using the activation function:
		hj = 1/(1+sum from i = 0 to A w1ijxi) for all j = 1...B
		
	3. Propagate the activations from the units in the hidden layer to the unit in theoutput layer
		oj = 1/(1+sum from i = 0 t B w2ijhi) for all j = 1...C
		
	4. Computer the errors of the units in the output layer denoted by delta2j, based on actual output(oj) and target output(yj)
		d2j = Oj(1-Oj)(yi-Oj) for all j = 1 ... C
		
	5. Adjust the weights between hidden and output layer, say learning raate is alpha
		w2ij = w2ij + alpha * d2j*hi for all i = 0,...B, j = 1...C
	
	6. Compute the errors of the units in the hidden layer denoted by d1j
		d1j = hj(1-hj) \sum from 1 to C d2i * w2ji for all j = 1 ...B
		
	7. Adjust the weights between the input layer and the hidden layer
		w1ij = w1ij + alpha * d1j*xi for all i = 0....A, j = 1...B
	
		
	8. Goto step 4 and repeat. When all input output pairs have been to the ntwork one epoch has been completed

The Kohonnen Neural Netowrk: -unsupervised machine leanring -self-organization -competitive learning -Non recurrent feedforward network which comprises two layres - the input layer and an output layer, the latter is referred as the Kohonen layer and is said to constitute Kohonene neturons -Dimensionality reduction ma use hudo raixa: used to produce a low dimensional typically two dimenstional representation of a higher dimenstinoal data set while preserving the topological structure of the data -For example, data set with p variables measured in n observations could be epresented as clusters of observations with similar values for the variables -These clusters could be visualized as a two dimensional map such that observations in proximal clusters have more similar values than the observations in distal clusters -Makes high-dimensioanl data easier to visualize and analyize -Uses competeitive leanring rather than the error correciton leraning

-Operate in two modes: training and mapping
-First training uses an input data set to generate a lower-dimensional represntion of the input data (the map space). second mappping classifies additoinal input data using the generated map
-The goal of tranining is to represent an input space with p dimensions as a map space with two dimensions specifically, an input space with p variables is said to have p dimensions
-Figure ko lagi one dimensional input dekhauni ra two dimensional output layer dekhauni ho hai sathi haru

Learning:
-Goal of learning in the SOM is to cause different part of the newtork to respond similarly to certain input patterns, this is partially motivated by how visual, auditory or other sensory information is handled in separate parts of cerebral cortex of the human brain
5. Randomize the weights 
6. Select an input vector at random, present the selectd input vector x = (x1, ... xn)
7. Find that Kohonen neuron j that has its associated weight vector (w1j, w2j, ...wnj) closest to the input vector x. Closness could be measured using suitable distnace function, a simple one being given by the Euclidean distance:
	dist(wj, x) = sum (wij - xi)^2 for i = 1 to n
It resembles the input most

8. Modify the weights of all neurons in the neighborhood of radius r of the winning neuron using the equation:
	wj(t+1) == wj(t) + alpha[x(t)-wj(t)]
9. This modifiaiton results in the winning neuron and its neightours to move closer to learn the input
10. Update alpha by reducing it graudlaly over the interations, this reduces the rate at which the neurons converge on the input

HOW APPLICATIONS ARE ACHIEVED:

Expert systems: -An expert system is a computer system emulating the decision making ability of a human expert, are designed to solve complex problems by reasoning throuugh bodies of knowledge, represented mainly as if-then reules rather than through conventional procedural codes -To solve expert level problems, expert systems need access to a substantial domain knowledge base which must be built as efficienlty as possible -An expert system is divided into two subsystems: the inference engine and the knowledge base. The knowledge base represents facts and rules. The inference engine applies the rules to the known facts to deduce new facts. Inference engines can also include explanation and debugging abilities. -Dendral whose primary aim was help organic chemists in identfiying unknown organic molecules

Advantages ko lsit for sake: high efficiency, not affected by emotions, high secuirty, regular updates, considers all the facts, no memory limiations, increased availability and reliability, fast response, explanation

SHELLS:

-Initially each exper systemt was created from scratch usually in LISP but after several systems had been built this way, it became clear that these systems often had a lot in common, in particular since the systems were constructed as a set of declarative representations (mostly rules) combined with an interpreter for those representation it was possible to separate the interpreter from the domian specific knoweledge thsu to create a system that could be used to construct new epxer systems by adding new knoweled correspondinng to new domain
-The resulting enterpresets are called shells, one example is MYCIN

1. Knowledge base (knowledge acquistion and representaion)
	-A knowledge engineer intervies a domain expert to elucidate expert knowledge which is then trnaslted into rules, after the initial system is built it must be iteratively refined until it approximates exper level performance
	-Many programs exist to interact with experts to extract knowledge efficinetly"
	>Entering knowledge
	>Maintaining knowledge base consistency
	>Ensuring knowledge base completeness
	-MOLE is a knowledge acquisiton system for heuristic classification problems such as diagnosing diseases
	-accepts input data, comes up with a set of candidate explanataions or classifications that cover or explain the data then uses differntiaiting knowelge to determine which one is best
	
	(time for representation):
	-Declarative, procedular, meta (kasari represent garni ta padhisakeko sathi)
	
2. Inference engine:
	-deterministi probabilistic ways (algorithms such as forward chaining, backward chaining)
	
3. Interfaces:
	-People must be able to interact with it easily, to faciliate which the expert system must have the following two capabilites in addition to thea bilty to perform the underlying tasks
	>Explain its reaonsing, people need convince, for eample a doctor must accept ultimate responsibilty for a diagnosis
	>Acqurie new knoweldge and modifications of old knoweldge

NLP: -By studying language, we can come to understand more about hte world, can test our theories about the world by how ell they support out attempt to understand langauge -If we suceed at building a computational model of langage, we will have a powerful tool for communicating about the world -How to program computers to process and analyze large amounts of natural language data, goal is a computer capable of understanding documents, cincluding the contextual nuances of the langauge within them, can extact information and insights ocntained in the documents as well as categorize and organize the documents themselves -Includes both understanding and generation as well as other tasks such as multilingual translation -Useful to divide the entire langauge processing problem into two tasks: 4. Processing written text, using lexical, syntactic and semantic knoweldege of the langauge as well as the required real world information 5. Processing spoken language, using all the inforamtion needed above plus additoinal knowldge about phonology as well as enough added informal ion to handle the futher ambuigities that arise in speech

Includes: OCR, speech recognition, speech segmentation (voice to words), text to speech

Steps:

with example of "I want Ram's phone number"

6. Morphological analysis:
	-Individual words are analyzed into their components and nonword tokens such as punctuation, are separated from the words
	-Put apart the word "Bill's" into the proper noun Bill and the possesive suffix s
	-Recofgnize the sequence .init as a file extension that is functioning as an adjective in the sentence
	-Assign categories to all the words in the sentence

7. Syntactic analysis: linear sequences of words are transformed into structures that show how the words relate to each other, some word sequences may be rejected if they violate the langauge's rules for how words may be combined for example english syntactic analyzer would reject the sentence "boy the go the store"
	-Goal is parsing, is to convert the flat list of words that forms the sentenc einto a structure that defines the units that are represnted by that fiat list
	-COnverts into a hierarchieal structure that will correspond to meaning units when semantic analysis is performed
	-EUTAL TREE

8. Semantic analysis: The strucutres created the syntatic analyzer are assigned meanings, in other words a mapping is made between the syntatic structures and objects in the task domain. Structures for which no such mapping is possible may be rejected, for example  the sentence colorless green flower may be rejected
	-Maps individual words into aprrorpate objecs in the knowelledge base or database
	-Must create the structures to correspond to the way the meanigns of the invidual words combine with each other
	-Frame based knowledge ko examples

	Sentence ko Reference markers
		instance: wanting
		agent: I
		object: want sita's number

	RM2:
		instance: want
		agent: I
		object: Ram's number
		
	and so on

9. Discourse integration: Meaning of individual sentence may depend on the setntences that prece it and may influendce the meaning of the sentences that folow it. For exaple it in the setnence john wanted it depends on the prior context
	-At this point know what kind of things this sentence is about but do not yet know which specific indidivuals are referred to
	-We dont know I and Sita are
	-Can know about current user throug login name, and named Sita could be user 8932

10. Pragmatic analysi: The structure representing what was said is reinterpreted to deterine what was actually meant. For example the sentence "Do you knwo what time it is" should be interpreted as a request to told the time
	-Decide what to do as a result that is to give number

The boundaries between thse five phases are often ery fuzzy, the phases are peromred in sequencey and they are performed all at once sometimes


Problems could be there:
	>Same exression mean different things: where is water?
	>Lots of ways to say smae thing, Rams birthday
	>Sentences and phrases might have hidden emotions and meanins
	>Due to complicated syntax and semantics of some languages
	>Use of gramatically incorrect snetences and puncutally wrong also

Computer visiosn: -Acquiring, processing, analysizng and understanding digital images, and extraction of high dimensional data from the real world in order to produce numberical or symbolic information eg in formms of decisions -Understanding in this context means the transformation of visual images into description of the world that make sense to though processes and can elciti appropriate actions -Image understanding can be seen as disentanglig of symbolic information from image data using models constructed with the aid of geometry, phsyics,a statistics and learning thoery -How human brain is good at getting data from visual that is the inspiration -Applications in robotics and all that -Typical tasks: Recognition: Motion analysis: Scene reconstruction: Image restoration:

-methods or processe:
	1. Image acquaiiston: through sensors and camers
	2. Pre processing: noise reducuton, re sampling consrast and brightnes smanagmeent
	3. Feature exrction: extract features at various levels, such as lines edges and ridges, interest points such as corners or points
	4. Detection/segmentaiton: decision on what part of the image deserve further processing, selection of specific interest points, 
	5. High level processing: Now typically a small part of images, classifuing detected objects into diferent categoieres, comparing different vies
	6. Decision making: Pass or fail, what happens, instsdlfjs;df
	
PulnixZiCam and other camers:
	-Uses hardware neural network that can extracts 64 features includinghostrogram, profiles and pixel saples
	-Has 74 outputs so instead of giving pass or fail can be trained to seaprate products upto 74 calsse

\newpage \section{Classical Machine Learning (prob data mining)} \subsection{Data mining intro} \begin{itemize} \item Also called knowledge discovery in database (KDD) \item Can be performed on variety of databases on warehouses and sometimes on transactional as well \item Warehouses have three tier architecture- top tier: warehouse server usually a relational database which collects, cleanses, and transforms data from multiple sources through ETL, middle tier: OLAP that enables fast query speeds, top tier: front-end user interface or reporting tool, which enables end users to conduct ad-hoc data analysis \item Data objects are described by a number of attributes that capture the basic characteristics of the object \item Dimensionality: number of attributes that the objects in the dataset possess, data with a small number of dimensions tend to be qualitatively different than moderate or high-dimensional data \end{itemize}

Types of attributes; \begin{enumerate} \item Nominal \begin{itemize} \item Are just different names \item Available operations: equalities, \item Transformation: can be transformed with any one-to-one mapping \item Stats: mode \item Ex: if employees ID are reassigned, it will not make any difference \end{itemize} \item Ordinal \begin{itemize} \item Enough information to order objects \item Available operations: equalities, comparisons \item Transformation: new value = f(old value) where f is a monotonic function \item Stats: mode, median \item Ex: good, better, best can be represented equally as {0, 1, 2} or {1, 2, 3} \end{itemize} \item Interval \begin{itemize} \item Difference between values is meaningful \item Available operations: equalities, comparisons, addsub, \item Transformation: new value = a * old value + b \item Stats: mode, median, mean \item Ex: Farenheit and Celcius \end{itemize} \item Ratio \begin{itemize} \item Both differences and ratios are meaningful \item Available operations: equalities, comparisons, addsub, muldiv \item Transformation: new value = a * old value \item Stats: mode, median, mean, geometric mean \item Ex: Length can be measured in meters and feet \end{itemize} \end{enumerate}

Measurement and data collection issues \begin{enumerate} \item random noise and deterministic artifacts: elimination of noise is frequently difficult, and much work in data mining focuses on devising robust algorithm that produce acceptable results even when noise is present \item outliers: unlike sometimes may be of interest \item inconsistent values: \item duplicate values: \item missing values: drop or estimate with mention

\end{enumerate}

\subsection{Data preprocessing} \begin{itemize} \item Sampling: selecting a subset of data objects to be analyzed, a sample is representative if it has approximately the same property as the original set of data, a sampling scheme that can accommodate a differing frequencies for the items of interest is stratified sampling \item Feature creation: \item Dimensionality reduction: Feature subset selection, (ratios only) PCA, SVD, Fourier transforms \item Binarization: algorithms that find association patterns require that the data to be in the form of binary attributes, if there are m categorical values, then uniquely assign each original value to an integer in the interval [0, m-1], if the attribute is ordinal, then order must be maintained by the assignment, if there are too many categories, better to resort to feature creation, then convert those numbers into binary and each binary digit represents an attribute; one hot coding \item Discretization: transformation of a continuous attribute to a categorical attribute involves two subtasks: deciding how many categories to have and determining how to map the values of the continuous attribute to these categories \item Variable transformation (ratios): especially sqrt, log and 1/x are used to transform data that does not have Gaussian distribution into that does; another common type is normalization \end{itemize} \subsection{Measures of similarities} \begin{itemize} \item Is a numerical measure of the degree of which the two objects are alike \item Dissimirality can be mapped to the interval [0,1] by using $d^{'} = \frac{( d - mi n _{d} )}{ma x _{d} - mi n _{d}}$ \item Convering diss to similarity is straight forward s=1-d if dissimilarity falls in the interval [0,1] \item Another simple approach is the negation, if restriction to [0,1] is desired, then transformation such as $s = 1/ (d + 1)$ , $s = e^{- d}$ , or $s = 1 - (d - min d) / (ma x d - min d)$ \item Dissimilarities between simple attributes: Nominal (d = 0 if x = y, 1 if x != y); Ordinal (d = abs(x-y)/(n-1)) after ordering with numbers \item Distance measure as dissimilarity between datas: $d (x, y) = (\sum_{k = 1}^{n} (x k - y k)^{r})^{\frac{1}{r}}$ ; the Euclidean of this is metric (positive, symmetric, triangle inequality) \item Some other non metric dissmiarities include set differences and time \item For symmetric measure, triangle inequality does not hold, ex: for binary attributes only: simple matching or jaccard coefficient; for documents is cosine similarity \end{itemize}

\subsection{Stats and visualization} \subsection{Classification} \begin{itemize} \item Employ a model that best fits the relationship between the attribute set and class label of the input data well and correctly predict the class labels of records it has never seen before \item Evaluation of the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model (confusion matrix); actual on left and predicted on top \item Decision Trees: In a decision tree, each leaf node is assigned a class label, non terminal nodes, which include the root and other internal nodes, contain attribute test conditions to separate records that have different characteristics \item Hunts algorithm to build decision tree: is to grow in recursive manner by partioning the training records into succesively purer subsets, Dt be the set associated with node t; then if all records in Dt belong to same class yt, then t is a leaf node labelled yt; else an attribute test condition is selected to partition the records into smaller subsets \item Two questions: how should the training records be split? how should the splitting procedure stop? \item Selecting the best split? Need discretization to split; measures that can be used to determine the best way to split the records; defined in terms of the class distriution of the records before and after splitting; intially at a node where there are records belonging there, it has a distribution of classes so there is a probability distributio hence also entropy; again after splitting; there will be probability distribuiton of children, each having their own distribution; if we weighted sum impurity of them and subtract with original node we get change in impurity \item Characteristics: \begin{itemize} \item Is nonparamteric ie does not require any prior assumptions regarding the type of probability distribution satistifed by the class and other attributes \item Finding an optimal decision tree is NP complete problem, so rqeuires heuristic such as ours \item Easy to interpret \item Quite robust to the presence of noise, especially when methods for avoiding overfitting are employed \item Presence of redundant attributes does not adversely affect the accuracy of decision tree \item … \end{itemize} \end{itemize} \subsection{Overfitting} \subsection{Clustering} \begin{itemize} \item \end{itemize}

\newpage Note: NOISY DATA CAN OVERFIT

Note: -Dont know where to write it but, we do cross_val_score to find mse and also you could see we did demo of accuracy, but later actually did cross_val_predict to find each predictions, this is for calculating each accuracy, recall and whatever there is -cross_val_score gives ‘scoring; for eaech CV but cross_val_predict gives output for each of them

JUMP TO

Supervised learning: >Given a training set of N example input-output pairs: (x1, y1), (x2, y2) … (xn, yn) where each pair was generated by an unknown function y=f(x) the goal of a learning algorithm is to discover a function h that approximates the true function f >H = class of modles from which the h is drawn, a class contains instances of models where the complexity varies >So the situation is that we need to approximate the functions f which we dont know for the inputs which we dont have? how do we even know that our approximation is reliable?

Main challenges of data:
	1. Insufficient quantity of training data as more data is better sometimes even more than algorithm
	2. Nonrepresentatitive training data of the situations where the model will actually be used due to flawed sampling
	3. Poor quality data, as data could be from good sampling techinques but full of errors
	4. Irrevelant features, jpt features bhako data ni hudo raixa k teslai hatauna paryo
	
	Summary:
	datacount -> representation -> quality -> features
	
	Data assumption:
	-Assumption on how the future examples would look: make assumptions that the future examples will be like the past (stationary assumption), assume that each example Ej has the same prior probaiblty distribution
	
		P(Ej) = P(Ej+1) = P(Ej+2) = ....
		
	and is independent of the previous examples:
	
		P(Ej) = P(Ej | Ej-1, Ej-2, ...)
		
	Examples that satisfy these equations are indendent and identicaly disstriuted iid



Main challenges of data and algorithms together:
	0. OVERFITTING:
	-Well fitted when training but poor result in actual practice -a complex model does so, to solve which either solve data(1),(2) or smoothen the complexity of the model yeslai variance high bhayo pani bhanxa
	-Possible to measure the complexity of a model by the minimum descriptoin length (MDL) which is the total number of bits required to represent the model without information loss

	1. TESTING:
	-To see how the model performs divide the examples into training and test set in usual 80-20%, yo testdata chai hamro secret hunxa nobody touches it until the model is ready
		
	-THe goal of the agent should be to minimize what we called the expected loss over the test data L(y,y'): L(actual y, predcited y') 
	-Commonly used L's
		a. L0/1: sum over all samples: 1 if incorrect, 0 if correct
		b. L1: over all samples: absolute diff between actual and predict
		c. L2: over all samples: squared difff between actual and predict

	-Aba hamro goal bhaneko chai actual data ko error measure ghataunu ho but thats not usually possible because we dont even have the actual data, so we minimize the emperical expected error using training data itself
	-To do any of these we need a prior probabilty of distribuiton P(X,Y) over examples, actual real life situtaion bata (x,y) instance random variable ko prbability distriution k ho ta bhanda P ho	
	-Let e be the set of all possible input-output examples then expected generalization loss for a hypotehsis h with repset to loss functio L is:
	
	GenLoss_L(h) = SUM_{(x,y) in e} L(y,h(x))P(x,y)
		
	-Because P(x,y) is not known in most cases, the learning agent can only estimate generalization loss with empirical loss on a set of examples E of size N:

	EmpLoss_{L,E} (h) = SUM{(x,y) in E} L(y,h(x))*1/N
		
	-Hence takeaway is that you want to make your model good, which is measured by expected loss over the test data which is empiricaled by expected loss over training data, in expectation since probalbity is not known, we assume uniform (but is it a good approximation to assume uniform for some what god knows distribuion)
		
	2. MODEL VALIDATION:
	-So gotta choose from a family of classes, parameters haru hunxa jun differ garda same class bich models vary hunxa, call these knobs hyperparamters which are the paramters of the model class
	-But the thing here to notice is that if you make one model and use test data to check on it look at the result and improve model and go on the process is biased because even if the researcher had good intention the process is flawed
	-So we need three divisions usually in 70%-15%-15%
	3. A traingi set to train candidate mdoels
	4. A validation set also to evaluate the candidate models
	5. A test set to do final unbiased evaluation of the best model
	
	Learning graph = Error measure vs capacity of the model 
	Training merit: Usually decays straight down with some minor fluctuation
	Test merit: Usual follows a U pattern or sometimes bumps over and decreases towards zero
	
	>which comes down to how different model classes make use of excess capacity and how well that matches up with problem at hand, as we ad capacity to mode class we oftern reach the point where all the training examples can be represented perfectly within the model
	
	Two ways:
	a. Simply choose the lowest point of U if finding the graph is easy but if capacity is a function of multiple parameters then is grid search: try all combinations of values and see which perofrms best on validation data, diff combinaion can run parallel on different machines so with reosurces need not be slow. The search algorithms come into play such as random search
		
	b. Cross validaion:
		-a better technique called k-fold cross validation whose idea is first we split the training data into k equal subsets, then perform k rounds of learning; on each round 1/k of the data are held out as a validation set and the remaining examples are used as the training set
		-So euta subset chai validation aru chai training hai ta, repeat for k rounds and average test set score of k rounds should then be a better estimate than a single core
		-popular values of k are 5 and 10 enough to give an estimate that is statistically likely to be accurate, at a cost of 5 to 10 times longer computation time, even need separte test set here
		-The extreme end is k = n also known as leave one out cross validation or LOOCV
		-Error is the average of these k trains
		-Can be used for comparing any types of models not relating to each others so just a compariosn tool which can also be used for hyperparamters tuning
	
	-Model selection algorithm takes as argument a learning algorithm, learner takes one hyperparamter which is named size in the figure
	-It starts with smallest value of size; yielding a small model which will probably underfit the data and iterates through the larger values of size considering more compelx models in the end selects the model that has the lowest average error rate on the held out validation data
	
	
	6. DIRECT COMPLEXIY PUNISHMENT:
	-Cross validation garera aba models ko optimal complexity nikalnu ta thik xa tara kaile kai direct model mai complexity lai punish garnu parne milxa
	-We saw in model selection how to do cross-validation (k-cross validation ma k garinthyo thaxa ni haina k subsets ma divide garera), tesma ta harek size ko lagi euta error nikalinthyo ni ta, an alernative approach is to search for a hypotehsis that directly minimizes the weighted sum of empirical loss and the compleixty of the hypotheis which we call the total cost:
	
		Cost(h) = EmpLoss(h) + lambda Complexity(h)
	
	[and choose model that minimizes this cost]
	
	-First the complexity(h) functions depends on the H we are dealing with
	-Here lambda is a hyperparamter a positive number that serves as a conversion rate between loss and hypothesis compelixty, if lambda is chosen well it nicely balances the empirical loss of a simple funciton against a complicated functions' tendecny to overfit
	-Now lambda is itself a hyperparameter that needs its values: one way to find is definitely cross validation but there are other techinques as well

TOTAL STEPS: 1. Take a quick look at the data structures columns and describe them to see the stuffs

2. Create a test set (data, test_ratio)
	np.random.seep(42)
	shuffled_indices = np.random.permutation(len(data))
	test_set_size = int(len(data)*test_ratio)
	test_indices = shuffled_indices[:test_set_size]
	train_indices = shuffled_indices[test_set_size:]
	return data.iloc[train_indices], data.iloc[test_indices]
	
	OR,
	
	from sklearn.model_selection import train_test_split
	train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

	Wait, if datasetis not large engouth especially relative to the number of attributes then random sampling introduces a sampling bias: so instead the population is divied into homogenous subgroups called strata, and the right number of instances is sampled from each stratum to generate that the test set is representaitive of the overall population

	new series = pd.cut(series to categorize, bins = [...], labels = [...])
	from sklearn.model_selection import StratifiedShuffleSplit
	split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) #only one split needed
	for train_index, test_index in split.split(sample garnu parne frame, new series):
		strat_train_set = housing.loc[train_index]
		strat_test_set = housing.loc[test_index]
		
	>So we iterate on the splits and assign the sets


3. Discover and visualize the data to gain domain insights:
	-Two types of attributes:
		>Numerical
		>Categorical
	-See which one is which and do following analysis for each of them:
	-For numerical see the distributions of each in a frequency diagram
	-For categorical ko lagi ni testai kei possible hunxa hola
	
	-If training set is quite large then sample some data to do visualization and stuffs else just copy them
		housing = strat_train_set.copy()
	-If the dataset is not large can find the correlation between every pari fo attributes using corr() method:
		corr_matrix = housing.corr()
		
	-See which one of the attribute is highly correlated to the one we want to predict as that will be the most promising attribute and see it further or something
	
	-Now try some attributes that are not present here by dividing and shits finding cost per household and what not which is possible by simpling divising the serieses together and try to add to the dataframe and do correlation matrix and see if there is something there

	
4. Data cleaning:
	-Models cannot work with missing data so get rid of whole attribute or the row or assign it to zero or median value
	-Do as the pandas frame provide us OR
	
	from sklean.impute import SimpleImputer
	imputer = SimpleImputer(strategy="median")
	
	-Need to drop the text stuffs from here
	
	housing_num = housing.drop(all the text columns)
	housing_num = housing.select_dtypes(include=[np.number])
	imputer.fit(housing_num)
	X
	
	Custom transformers:
	-fit() returning self
	-transform()
	-fit_transform() which comes for tree by inhering TransformerMixing as a base class
	>Also if you add BaseEstimator you get get_params() and set_params() useful for automatic hyperparamteers uting
	


=For PAC learning the juice is provided by the stationary assumpton which says that future examples are going to be drawin from the same fixed distribution P(E) = P(X,y) as past examples (dont need to know what distribution just that it does not change)
-In additon assume f is determinsitc and is a member of H that is being considred

=Any learning algorithm that returns PAC solution is a PAC learning algorithm
-Simplest of PAC theorem deal with Boolean functions for which the 0/1 loss if appropriate, the error rate of hypothesis h, defined informally is defined formally here as the expected gneraliztion error for examples drawn from stationary distribution
	error(h) = GenLoss_{L0/1}(h) = sum on x,y [L_{0/1} (y,h(x)) P(x,y)]
	
-In other words, erro(h) is the probabiily that h misclassifies a new example


-A hypothesis h is called approximately correct if error(h) <= e where e is a small constant, we will show that we can find an N such that after training on N examples with high probability, all consistent hypotehsis will be approximatley correct
-Approximatley correct likes inside e-ball around true function outside is Hbad

-We can derive a bound on the probabilty that a seriously wrong hypotehsis hb in Hbad is consistent with the first N examples as follows:
-We know that error(hb) > e thus the probabilty that it agrees with a given examples is at most 1-e, sicne examples are independnet
	h(hb agrees with N examples) <= (1-e)^N
-The probabilyt that Hbad contains at least one consistent hypothesis is bounded by the sum of the individual probabilites:
	P(Hbad contains a consistent hypotehsis) <= |Hbad|(1-e)^N <= |H|(1-e)^N
	
where we have used the fact that Hbad is a subset of H and thus |Hbad| <= |H|we would like to reduce the probability of this event below some small number delta d:

	P(Hbad contains a consistent hypotehsis) <= |H|(1-e)^N <= d
	
-Gien that 1-e <= e^{-e} we can achieve this if we allow the algorithm to see

	N >= 1/e (ln1/d + ln|H|) examples
	
-Thus with probaiblty 1-d after seeing this many examples the learning algorithmw ill return a hypotehsis that has error at most e, in other words it is probabily approximatley correct, the number of required examples as a funciton of e and d is called the sample complexity of the learning algorithm






















Regression:
	-Linear regiresson where each example x is an n-element vector, out hypothesis space is the set of functions of the form 
	
	h(x vector) = w0 + w1x1 + ... + wnxn
	-Let N be number of training examples and n be features (w0...wn)
	-In linear algebra power and introduce x0 = 1
	
	h(x vector) = (w) dot product (x)
	
	-We find w by minimizing what we call the expected values of loss over examples
	-Gauss showed that if yj values are normally distributed noise, then the most likely values of w's are obtained by using L2 loss, minimzing the sum of squares of the errors, if the values have noise that follows a Laplace distribution then L1 loss is appropriate
	-If we have some w vector then its loss depends on examples X vectors and y labels
	
		L(w vector) = sum over i (yi'- yi)^2/2N (expected loss)
		
		[The divide by 2 is for beaut]
	
	-To minimize we need the graidnet ofc:
	
		dL(w vector)/dwi = (yi' - yi)xi
		
	-Vectorially also has a nice form: 

		grad L = X^T (Xw) [X matrix made by arranging xvectors rowwise]
		
	[Tempting to expand the inverse right? but the X is not necessarily invertible bro]
	-Then later:
	
	-Hence training takes O(n^3) because there are matrix multiplication invovled
	-Another way is by SVD as follows:
	
		X = s@v@d
	then,
		X+ = d.T@v+@s.T
		
	-To calculate v+, take v and set zero to all values smaller than a tiny threshold value, then replaces all the nonzero values with their inverse and finally tranpose the resulting matrix
	-So the inversion step in original has been replaced by finding v+ which takes O(n^2) as compared to O(n^3) where n is the number of features so the scaling with features is obvious here
	
	CODING:
		from sklearn.linear_model import LinearRegression
		lin_reg = LinearRegression(
		
			copy_X:	True, //to overwrite or not
			fit_intercept: True //whether to set bias to zero
			positive: False //are all attributes weights positive?
			n_jobs: None // How many CPUs to use
			
		)
		lin_reg.fit(train_set, train_set_labels)
		predictions = lin_reg.transform(train_set)
	
	
	But before that:
	-Linear models requires dataset to be:
		>Fully numericals
		>No missing data at all
		>Better to minimize the outliers
		>Adding and removing attributes on coorelation
	
	1. OneHotEncoder() to convert non numerics into numerics
	2. SimpleImputer(strategy='median') for placing out the missing values
	3. Custom
	4. Custom
	
	>They are put through the ColumnTransformer to generate numpy arrays of trainset
	
	Gradient descent:
	-Visuailze weights vs cost graph which is always convex if you use L2 so another reason to use L2, the gradient point to steepest ascent so we need negative of it
	
		w(new) = w(old) + eta * direction of steepest descent
			
			>Direciton of steepest descent depends on current loads of examples and their output with old weightss
			
	-The size and direction of next step is given by:
		w(new) = w(old) - eta * graident of MSEs

	-The use of whole examples at once is called the batch graident descent, a faster variant is called stochastic gradient descent;
	-It randomly picks up a random instance in the training set at every step and computes the graidnets based only on that single instance
	-Much less regular than batch insstead of gently decreasing will bounce up and down decreasing only on average, overtime it ends up very close to the minimum but once it gets there it will continue to bounce around, never settling down, so once it stops, the final parameter values are good, but not optimal
	
	-Now there are few hyperparameters to vary 
		>eta is the learning rate
		>number of iterations to decide
		>tolerance level of the last epoch graident
		>learning schedule by which the eta varies over epoches
	-Each provided and implemented in SGDRegressor of sklearn	
	
	-A gradient descented depends on distance so scaling is better you know
	
	CODING:
		from sklearn.linear_model import SGDRegressor
	
		sgd_reg = SGDRegressor(
			eta0 = 0.01,
			max_iters = 1000,
			tot = 0.001
			learning_rate = 'invscaling'
		)
	
		sgd_reg.fit(train_set, train_set_labels)
	
		predictions = sgd_reg.transform(train_set)
	
	>The gradient descent stops on the following condtions:
	1. If reaches max_iter
	2. If reaches tol level of tolerance fo n_iter number of iterations which could use validation set if you do early_stopping=True

	
	Polynomializing:
	-Linear ta alli nahuna sakxa haina halka quadratic or cubical huna sakxa testo case ma attributes lai convert garera higher degrees ko ni add garni paryo
	
	CODING:
		from sklearn.preprocessing import PolynomialFeatures
		
		poly_features = PolynomialFeatures(degree=2, include_bias=False)
		
		X_poly = poly_features.fit_transform(X)
		
		which adds features upto degree in addition to the existing ones
		
	
	Learning curves:
	>Can do cross vliadation to get an estimte of a models general performance, if a model performs well on the trainign data but generlized poorly on cross validation then is overfitting, if poor on both then underfitting
	>Another way is to look at the learning curves: plots of the models performance on the trainiigns set and validatio set as a function of the traning set size, to generate the plots simply train the model several times on different sized subsets of the training data
	
	CODING:
	from sklearn.model_selection import learnign_curve
	train_sizes, train_errors, validation_errors = learning_curve(
	
			estimator = <trained model name here>
			X = train_set,
			y = train_labels,
			cv = 3,
			train_sizes = [1000, 2000, 3000, 40000, 5000]
			scoring = 'neg_mean_squared_error'
	)
	
	train_errors = [np.average(np.sqrt(-x)) for x in train_errors]
	validation_errors = [np.average(np.sqrt(-x)) for x in validation_errors]

	scores = pd.DataFrame(
		{
		    "train_sizes": train_sizes,
		    "train_errors": train_errors,
		    "validation_errors": validation_errors
		/
	/)
			
	-When both curves have reached a plateau and they are close and fairly high means they are underfitting; they cannot improve if new data is added so required complex models
	-If there is gap between curves means performs significantly better on trianing than validation, if new data is added would slowly improve 
	
	Regulaizarion:
	-To constraints the weights of the model add a direct punishment to higher weights in the model itself
	-Ridge adds squared, lasso absolute, which are scaled by a factor of alpha, increasing which increases the degree of punishment
	-Which to use depends on specific problems, but L1 temds to produce a sparse model that is often sets many weights to zero, effectively declarin the corresponding attributes to be completely irrevelant
	-This is because for some value of lamda the compleixt(h) <= c where c realtes to lamba, now for L1 regulartion the curve of complexity <= c is a square whose vertiex are on axes equatoin is probalby |x|+|y|=1 swile for L2 it is a circle, now the weight must lie on to this box or circle or whatever in higher dimeions
	
	CODING:
	
		from sklearn.linear_model import Ridge, Lasso
		
		Lasso(alpha=1)
		Ridge(alpha=1)
		
		OR in gradient descent:
		
		SGDRegressor(..., alpha=1, penalty='l2')


	FLOW:
		-Requires fully numerical and complete dataset and probably no outliers
		-To add polynomializing features or not brother
		-always do some ridging
		-To do analytical or gradient descent

	>To choose alpha we could can do grid search on the hyperparameters that try all the combinations and find you the best one:
	
	CODING:
	----------------------------
		
Linear classifiers:
	-Can be made for classification using the straight line as a decision boundary that separates the two classes, data need be linearly separable
	
	h(x vector) = sigma (x dot w)

	-The sigma function makes sense to be a heaviside but train garna mildena so we use something called sigmoid whose output can be interpreted as probabliy that x vector falls in class 1
	-Loss function is given as:
	
	L(w vector) = expected cross entropy between our predicted and actual class
	[earlier was expected L2 loss over difference between predited and actual value]
	
	L(w vector) = sum over all samples later divided by N
		
	[distribution of being in class 0 log (1/same predcited) + distribution of being in class 1 (1/pred)]
	
	As compact,
		    = sum over i: -yi*log(sigma(yi'))-(1-yi)(1-sigma(yi'))]/N 
	
	[which when solving for graident descent taking derivative has no closed form for finding w*]
	
	[from future: i think the closed form is same as that of L2 loss expect for that 2]
	
	-The gradeint update rule is based on neurons that wire together fire together form
	
	CODING:
		from sklearn.linear_model import SGDClassifier
	
		sgd_clf = SGDClassifier(
		
			loss = 'log_loss'
			eta0 = 0.01,
			max_iters = 1000,
			tot = 0.001,
			learning_rate = 'invscaling'
			alpha=1,
			penalty='l2'
		)
	
		sgd_clf.fit(train_set, train_set_labels)
		
		sgd_clf.predict_proba(train_set)
		sgd_clf.predict(train_set) where decision boundary is 0.5
	
Softmax classifiers:
	-Can be generalized to support multiple classes directly without having to train and combine mulitple binary classifiers
	-Now there are many w vectors which will tell the probabilty of x vector falling into that class
	-Hence,
	
	h_i(x vector) = exp(x dot w) / (sum over j exp(wj dot x))
	
	This is h function of (x, and every w there is)
	
	is the probabilty that x vector falls into ith class where tyo exp wala lafda xa ni tyo chai generalized sigmoid jastai ho bhanda hunxa
	
	-Now highest probablity bhako lai yei class ko ho bhanda huni bhayo
	-Loss function:
	
	L(w vectorS) = expected cross entropy using distribuiton for predcited and actual class:
		
		     = sum over examples:
		     [sum over k: distrution of being in class k * log(1/predicted class distriution for k)]
		     
	L(w vectorS) = sum[-sum on each class k: actual_y*log(sigma(Xw_i)]/N
	
	-For two classes is equivalent to logistic regression so is a pretty good generalization
	-Notice that the loss function is actually expected cross entropy where actual distribution is 0s and 1s from our dataset and predicted is our function predcition, we try to minimize it as much as possible so we are certain about the inforamtion

	CODING:
	-There is another logistic class which uses other weird form of learnings which probably i wont understand
	
		from sklearn.linear_model import LogisticRegression
		softmax_reg = LogisticRegression(multi_class='multinomial')
		softmax_reg.fit(train_set, train_set_labels)
		softmax_reg.predict_proba(train_set)
	
	
Support Vectors:
	-The logistic was based on linear stuffs generalized but there is an another way to approach making decision boundary in data sets
	
	Regression:
	-We find the hyperstreet parallel hyperlnes, that will contain as many data points inside as possible can tune the paramter as well then we fit the line right straight through the parallel lines
	
	CODING:
		from sklearn.svm import LinearSVR
		svm_reg = LinearSVR(epsilon=1)
		svm_reg.fit(train_set, train_set_labels)
		predictions = svm_reg(train_set)
		
	-Again epilon is required to be grid search for best answer possible
	-The under the hood solves it by quadratic programming techniques which are for gods only
	-To use this requires dataset to be:
		>Fully numericals
		>No missing data at all
		>Better to minimize the outliers
		>Adding and removing attributes on coorelation
		>Standard scaling as it is distance based algorithm
	
	1. OneHotEncoder() to convert non numerics into numerics
	2. SimpleImputer(strategy='median') for placing out the missing values
	3. Custom
	4. Custom
	5. StandardScaling() provides mean and standard deviaton based scalings
	
	-Could be done using SGD where the loss is called 'epsilon_insensitive'
	
	Classifiers:
	-Here we find the street that is wide to separate the classes as possible as there is also controlled by a hyperparamater to fitting the largest parallel lines between the two classes
	-If not linearly separable then is succumbed to margin violations which is sometimes okay

	CODING:
		from sklearn.svm import LinearSVC
		svm_clf = LinearSVC(C=1)
		svm_clf.fit(train_set, train_set_labels)
		predictions = svm_clf(train_set)
	
	-Can be done with SGD where the loss function is called 'hinge'
	
	Polynomial featuring:
	-Another way of dealing with it is to do Polynomial feauring like we did earlier brother which ofc is explosion but there is something called a kernel trick
	
	from sklearn.svm import SVC
	poly_kernel_svm_clf = SVC(kernel='poly', degree=3, coef0-1, C=1)
	
	-The parameter coef0 is for how low degrees affect the boundary than higher degrees or something like that

	Complexity: Is dreadfully bad for large datasets so take your datasets with you

In summary, the linears fits and their polynomial extensions:

1. Hot encode the categoricals
2. Impute missing values
3. Minimize the outliers (which could cause overfitting)
4. Add or remove attributes based on correlations
5. Polynomialising the attributes if you see the pattern
6. Standard scale the attributes
7. SV or Linears and their regulaizarions with SGD or not
[Softmax avaialble for LCs, kernel trick for SVCs]
8. Cross validate and plot the learning graph or the precision-accuracy tradeoff
9. Grid search if there are any paramters need to tune in



Tree classification:
-Attributes ko choice hunxa ani choices anusar branching gardai jani ho: so training a tree means to choose attribute to split the current node and the range at which to split if it is a numerical attribute
-One popular is CART which splits the set into two subsets using a single feature k and a threshold tk, which it finds by seraching for (k,tk) pairs that produces the purest subsets weighted by their size
-Makes binary splits by:
	J(k,tk) = (mleft/m) G_left + (mright/m) G_right
-The G here is the impurity/uncertaintly of the set measured by ginni or shanon entorpy as follows:
	>Ginni: One - Sum of square of probabilities of instances in each branches after split
	>Shannon: Entropy current - Expected entropy after making the split
-Usually both give the same tree but when it matters shannon generates balanced kind of trees while ginni is computationally better
-Decision boundary: each node divides the space up into rectangular axis aligned boxes where the axes would be the attributes domain
-The thing is the algorithm does not only assign classes to leaf nodes but also to the internal nodes so when you cut off the tree you still have answer, if some node is equally impure then chooses one of the class

CODING:
	from sklearn.tree import DecisionTreeClassifier
	tree_clf = DecisionTreeClassifier(
	
		criterion='gini',
		max_depth=3,
		min_samples_split=3 #how many samples must be in a node to consider split
		
	)
	
	tree_clf.fit(train_set, train_set_labels)
	predictions = tree_clf.predict(train_set)
	probabs = tree_clf.predict_proba(train_set)
	tree_clf.feature_importances_ #provides how each feature contribute to making decision

Tree regression:
-The same thing but now the predicted value is the actual numerical value which is calculated using average of all the instances that fall into that class
-Remember that each node is given a value regardless of leaf or internal
-The CART function is:
	J(k,tk) = (mleft/m) MSE_left + (mright/m) MSE_right where MSE is mean squared error
	

Ensembling: (we are doing for classif but same applies for regression as well)
-So far single hypothesis did the prediction right? the idea of ensemble learning is to select a colleciton or ensemble of hypotehsis h1, h2, ... ,hn and combine their predictions by averaging, voting, or by another level of machine learnign
-Why do this? ensemble is better than the best of the combinations due to law of large numbers with cost of individual models
-In practive independce is uncreasonble since invidivudal classifiers share some of the data and asusmptions, and thus not completely independent, and will share some of the same errors
-But if the component classifiers are at least somewhat uncorrelated then ensemble learning will make fewer misclassifications
	
CODING:
	from sklearn.ensemble import VotingClassifier
	voting_clf = VotingClassifier(
		
	estimators = [('lr', log_clf), ('sv', sv_clf), ('dt', dec_clf)],
	voting = 'hard'
	)
	
	voting_clf.fit(train_set, train_set_labels)
	predictions = voting_clf.predict(train_set)
	
	>If each of them has predict_proba() usually svm dont can do voting = 'soft' which gives weight to much confident votes
	
	[there is VotingRegressor]

Bagging:
-Same model but different data sets, where generate K distinct training sets by sampling with replacement from the original training sets ie we radomly pick N examples from the training set but each of those picks might be an example picked before
-Then get a hypotehsis based on that N, repeat process K times getting K different hypotehsis then when asked to predict the value of a new input, we aggregate the predictions form all K hypotehsese
-For classiicaiotn that means taking plurarilty vote
-For regression output is the average
	
Random forests:
-Bagging decision often ends up giving us K trees that are highly correlated if theere is one attribute with a very high informaiton gain it is likely to be the root of most of the tree, the random forest model is a form of decision tree baggin in which we take extra steps to make the ennsemble of K trees more diverse to reduce variance
-Can be used for both regression and classification
-Train on whole set but sample attributes on each split and choose best among the selected ones
-Really good at solving unstability problem of the decision trees by making crowd wisdom at work
	
	
CODING:
	
	from sklearn.ensemble import RandomForestClassifier
	rnd_clf = RandomForestClassifer(
		#each of decision tree
		n_estimators = 200,
		n_jobs = -1
	)
	rnd_clf.fit(train_set, train_labels)
	predictions = rnd_clf.predict(train_set)
	
[there is RandomForestRegressor]

Boosting:
-Combine weak learners into a strong learner, the general idea of most bossting methods is to predict predictors sequentially each trying to correct its predecessor
	
Gradientboosting:
-Fit the upcoming models to the difference between actual training labels and predicted labels (what?) then final result is by summing their result out, say first model predicted y' but output was y then next will be trained on y-y'		
	
CODING:
	import xgboost
	xgb_clf = xgboost.XGBClassifier(
		n_estimators = 200,
		n_jobs = -1
		learning_rate = 0.5,
	)
	xgd_clf.fit(train_set, train_labels)
	predictions = xgb_clf.predict(train_set)
	
[there is xgboostregressor too]

In summary,
	1. Direct implementation of decision tree, random forest and xgboost both reg and clf
	[supposts multiclass classification where each belongs to only one of them]
	2. Cross validate and plot the learning graph or the precision-accuracy tradeof
	3. Grid search if there are any paramters need to tune in
	
.......
All upcoming algorithms are numerical only and require complete data set +++




Non parametric models:
-Decison tree was one of non paramteric models, which is very succumb to overfitting, when data is too many why we interfere bro why dont we let the data speak for itself
-In modelling data are converted into some weights and is forgetten, in non paramteric the algorithm memorizes the data, such is a table look up

Nearest-neighbour models:
-To improve table lookup find k examples that are nearest to the query example xq according to some distance metrics, say NN(k,xq) to denote the set of k neightbors neares to xq
-For classificaiton could take most common output of yes or no, to avoid tie choose k to be odd
-To do regression take mean or median or do the linear fit on the obtained data itself
-Decision boundary is kind of weird man and does not follow any good polynomial pattern
-Is also subject to underfitting and overfitting just like parametric methods, for ex: 1-nearest-neighbors is overfitting; it reacts too much to the outliers, example deko xa book ma 5-nearest decision boundary is kind of good though, as usual cross validation use garera best value of k find garna milxa brother
>Distance metrics:
	-Typically distances are measured with Minkowski distance or Lp norm defined as:
		L^p(xj, xq) = [ sum i |xj,i - xq,i|^P ] ^1/p
	-With p = 2 this is Euclidean and with p = 1 is Manhattan distance
	-Eucliean is used if dimensions are measuring similar properties such as width, height and depth of parts, and Manhattan is used if dissimilar such as age, weight and gender of apatients
-If units are different the common approach is to apply normalization to the measurements in each dimension, can compute mean and standard deviation in each diemsnion and rescael x such that x becomes x-> (x-mean)/standarddeviatoin, a more complex Mahalanobis diatance takes into accout the covariance between dimensions
-Choose k neighbors and do plurarity voting or apply regression there only is the way to go
-When diemnsions of features is high distance between any two points in average is pretty damn high

CODING:
	from sklearn.neighbors import KNeighborsClassifier
	
	neighbor_clf = KNeighborsClassifiers(
		n_neighbors = 5,
		p = 2,
		metric = 'minkowski'
	)
	
	[there is also KNeighborsRegressor]


Bayes Classifier:
-Let D represent all the data, with observed value d, the key quantities in the Bayesian approach are the hypothesis prior, P(hi) and the likelihood of the data under each hypothesis, P(d|hi)
-The probabilty of each hypothesis is obtained by Bayes' rule:

	P(hi | given examples) = P(given examples | hi) P(hi)
	
-Another simplication is provided by assusming a uniform prior over the sapce of hypotheses, in that case MAP learning reduces to choosing an hi that maximmizes P(d|hi) that is called maximum likelihood hypotehsis
-So here we gotta assume a distribution of the hypotehses that provide us the examples then the bayesian will provide us with the MLE parameters of that distribution
-So each feature assumes a bernoulli distribution which is quite unlikely so gotta use in unison

CODING:
	from sklearn.naive_bayes import BernoulliNB
	
	naive_clf = BernoulliNB()
	naive_clf.fit(train_set, train_labels)
	naive_clf.predict(train_set)
	naive_clf.predict_proba(trainset)


	
PCA projection:
-Need to reduce dimensions of the features sometimes for increasing the training speed or sometimes for visualization purposes
-One major approach is projection in which in n-dimension space we choose d-dimension hypercube and project onto it, which hypercube to choose is by projecting the datasets on the axes and seeing whichevers variance is high as it means less information is lost
-The axis of the d dimensional hypercube are called components and to find them is easy just by taking svd of training set matrix where the d part of the decompositoin gives the component vectors (see the visualization of svd)
-Sad is it actually just means trianing time is reduced does not necessarily means performance is improved sometimes can degrade though

CODING:
	from sklearn.decomposition import PCA
	pca = PCA(n_components = 2)
	train_set_2d = pca.fit_transform(train_set)
	pca.explaind_variance_ratio_ #gives the percentage of data points in each axis
	
	>To choose the right number of dimensions for just decomposition do:
	
	pca = PCA
	pca.fit(train_set)
	cumsum = np.cumsum(pca.explained_variance_ratio_)
	d = np.argmax(cumsum >= 0.95) + 1
	
	>Now we could do n_components = d which preserves 95% of the variance
	

K-means clustering:
-To identify groups of similar looking objects, just like in classificaiton each instance gets assigned to a group with unsupervision
-Kmeans designate k blobs and assign data points to either one of them, first randomly initializes the centroids of each blob and points are assigned by the distance metric of which centroid they are close to
-Can be done iteratively by assiging points to the closest centroid value blob, then calculating centroid of points into the blob and so on, centroid is just mean of all the points
-Does not necessarily go to global minimum, may be stuck on local minimum as well sometimes well
-If there are not absolute blobs then quality of this shit decreases significantly

CODING:

	from sklearn.cluster import KMeans
	kmeans = KMeans(n_clusters=5, init=[[centroid 1], [centroid 2], [], []], n_init=1)
	kmeans.fit(train_set)
	predictions = kmeans.predict(train_set) #returns an array of numbers belonging class
	
	kmeans.cluster_centers_ #is the list of centroid found

	kmeans.transform(train_set) #for each example provides distance from each centroid so essentially decomposes into k-dimension

\chapter{Deep Learning}

Note: Overfitting is hard to measure so we use vvalidation set, while test set are another kind of validation sets

“when a measure becomes a target it ceases to become a good target”

If you have many million of examples, take 1000s of them and plot them

Numpy also supports broadcasting as long as last axis matches

GO DEEP FOR LEARNING:

-Perceptron: 
-One of the simples ANN architectures with step activation function and no loss function told, trained as on the principle that neurons that wire together fire together, ie when two neurons have same output their connection is strenghted
	weights(next step) = weights(previous) + eta (yactual - ypredicted)*input
-Gives the linearly separable decision boundary
	
-Now general class of neurons trained using gradient descent:
-Recall that you need to provide these things for linears where training method is always the graident descent ofcourse with its paramters
	
	1. Number of input features:
	2. Actiation function:
		-Didnot really provide that for regression just assume it was linear line
		-For classifiers it was sigmoid and softmaxes
		-The activation function determines how good the gradient descent works
	3. Differentiable loss function so we minimize expected loss over examples:
		-L2 or cross entropy are loss functions
		-So we minimize them over examples assuming uniform distributions of 1/N

-Lets visualize the three general linears: regression, binary classifier and multiclass classifier using neurons:

	4. Regression: you have n input neurons as values (and ofc x0=1) fed to a neural represented as (sum symbol | activation function symbol) and it provides the output
	
	5. Binary clf: Same way instead we have softmax as activation symbol function
	
	6. Softmax: Now the case is different where we have n input neurons as values of the features and also another layer of neurons equal in number to no of classes for classification, while each neuron class gets its input from all previous neurons, here the softmax is not placed inside a ball but as a rectangle for all the outer layers
	
	[Hence each layer of neuron will have x0=1 in it which we take out of the picture for ease]
	
-MLP is to stack layers (including input and output layers now hidden layers) together 
-Trained using just two passes through the network (one forward and one backward, the backpropagation can find how each connection weights be tweaked to reduce the output error, process repeats until convergence just like GD
-Here is what happens in GD:
	>You already have some weights initialized to some values
	>For some mini batch (called epoch) you record the output for that batch 
	>You use the difference between predicted outputs and actual outputs to update weights
	
-What happens in backpropagation in details:
	>You have initializement random preferably to train a diverse team of neurons
	>For some epoch you record output but also the output of intermediate neurons
	>Now you use the predicted output and actual output difference also for each neuron uses its intermediate value to update the weights

Question? Does each neuron in every layer required an activation funciton, yes cause if you dont that just linear dudes in a line which is just some more linear, so need some kind of activation there


-Regeression MLP architecture:
#input neurons: one per feature
#hidden layers: depends on problem, typically 1 to 5
#neurons per hidden layer: depends on problem, typically 10 to 100

Hidden activation: ReLU?
Output activation: None or ReLU
Loss function: MSE or Huber if outliers

CODING:
	
#sequential API os keras
models = keras.models.Sequential()

#just calculates reshape(-1,1):
models.add(keras.layers.Flatten(input_shape=[28,28]))

#hidden dense layers with respective neuron capacitites:
model.add(keras.layers.Dense(300, activation="relu")
model.add(keras.layers.Dense(100, activation="relu")

//Here each layer handles a matrix of weights (one vector of weights for each neuron) and also a vector of bias term (one per neuron) and calulates following:

h(W,b) (X vector from previous layer) = phi (XW + b)

//bias written separately as is for the deep learning
//Except ofc the input layer which takes (non vector entity) and thrwos out the vector one

Also by directly passing:
model = keras.models.Sequential([
	 keras.layers.Flatten(input_shape=[28, 28]),
	 keras.layers.Dense(300, activation="relu"),
	 keras.layers.Dense(100, activation="relu"),
	 keras.layers.Dense(10, activation="softmax")
])


>Check model review by: model.summary() which gives following info:

1. Paramters of each layer clcaulated by number of inputs * number of neurons + number of neurons which is very large for 300 neurons which poses a risk of overfitting


>To refer to each of the layers do:
	model.layers[index]
>Can do set_weights() and get_weights() to each layer:
	weights, biases = layer_name.get_weights() return an numpy array of same

NOTE: The dense layer initializes the weights randomly but biases are put to zero
>Can do our own initilizement though later
>The certain number of paramters we saw is due to the fact that we did input_shape but if you dont, Keras will simply wait until it knows the input shpae before it actually builds the modoel, this happens when you actually feed it actual data or when you call its build()
>Until the model is really built, layers wil not have any weights, and you will not be able to do certian things (such as print the model summary or save the model)

-Model is created meaning that we have specified the input features and acitvations but we have not given its loss function and trianing method
-To do which is to COMPILE the model:

model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

Here note:
>We are using sparse cause the labels are one hot encoded: our output is a 10 elements of softmax while the labels are like [1] for coat and [2] for jacket and so onsss
>Sigmoid use garera multilabel classificaiton garni bhane chai we would be using the binary crossentorpy loss function


>Finally building and compiling model now time to train:

history = model.fit(X_train, y_train, epochs=20, validation_data=(X_valid,y_valid))

//instead of actuallly passing a validation set you could use validation_split=0.1 that kears will use

>What if like the previosu problem? the dataset was skewed? some arguments are there to pass
>Return the history object containing
	-The training paramters (history.params)
	-The list of epochs it went through history.epochs
	-The dictionary containg the loss and metrics it measured at the end of each epoch on the trianing set and on the validation set
>Can convert into a DF and plot to see the loss and accuracy as a fucntion of epoches
>Then you find from the curve that you could do better by increasing the epoch, in which case you could just call fit() since Keras continues trianing where it left off
........
>Now evaluate on test sets:
>model.evaluate(X_test, y_test) to check on the data set
>Get your predictions brother:
	
	y_pred = model.predict(X_new)
	
	returns an array with 1 in the place of its actual calss
	
>Regression: same way




Functional API:
-To connect all or part of the inputs direeclty to the output layer which makes it possible for the neural network to learn both deep patterns (using the deep path) and simple rules (through the short path)
-In contrast, a regular MLP forces all the data to flow through the full stack of layers thus simple patterns in the data may end up being distorated by this sequence of transformations
-For housing problem:

input = keras.layers.Input(shape=X_train.shape[1:])
//need to create as we may have multiple inputs, I see is that we may be sending scalar or vector arrays so better to get them in same format

hidden1 = keras.layers.Dense(30, activation="relu")(input)
//Tell functional API how to connect the layers together, no actual data is being processed, less neurons than classifiers cause data is noisy they say and we dont want to overfit

hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
//pass the hidden layer to second hidden layer

concat = keras.layers.Concatenate()[input, hidden2]
//inputs are coming into some hidden layers and also directly and need them to concatenate for the reason mentioned above

output = keras.layers.Dense(1)(concat)
//the concat layer will finally go into the output one

model = keras.models.Model(inputs=[input], outputs=[output])

//We had used models.Sequential in sequential api here we using Model by passing input layer and output layers
#Now everything is same as earlier compiling and fitting and predicting:


MULTIPLE INPTUS:
-Here we had all the features input going into the output through the same layers but ometimes you want some features to go through deep layers and others direclty to the output:
-They could be overlapped as well
-Completely different example where we want to send 5 features (0 to 4) through deep path and 6 features through the wide path (features 2 to 7):

input_A = keras.layers.Input(shape=[5,])

input_B = keras.layers.Input(shape=[6,])

hidden1 = keras.layers.Dense(30, activation="relu")(input_B)

hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)

concat = keras.layers.concatenate([input_A, hidden2]) 
//inputA features is directly coming while inputB features are coming through a deep layer

output = keras.layers.Dense(1)(concat)

model = keras.models.Model(inputs=[input_A, input_B], outputs=[output])

MULTIPLE OUTPUTS:
-Also cases when you want multiple outputs, like you want to locate and classify main object in a picture which is both regression for coordiantes and classification task
-Similarly may have multiple independnet tasks to perform based on same data, could be done by training neural net for each purpose but in many cases get better result as neural net can learn faetures in the data that are useful across tasks
-Another is regularization technique for eg you may want to add some auxilary outputs in a neural net architecture to ensure that the underlying part of the network learns something useful on its own without underlying on rest of the network, means same thing predicting but from different route

# same as above multi inputs

output = keras.layers.Dense(1)(concat)
aux_output = keras.layers.Dense(1)(hidden2)
model = keras.models.Model(inputs=[input_A, input_B], outputs=[output, aux_output])


-So what to do with compile one?
-We need to have a complete global loss function which we choose to be weighted version of losses from individual outputs as:

model.compile
(
	loss = ["sparse...", "sparse..."],
	loss_weights = [0.9, 0.1] //Give 0.1 to aux output as it is for regularization
	optimizer = "sgd"
	metrics=["accuracy"] //which comes out for each output layers
)

>Now fitting:

model.fit(

	[train_X_A, train_X_B], [train_y, train_y],
	epochs=20,
	validation_data ([valid_X_A, valid_X_B], [valid_y, valid_y])
)
Same way of evaluation:


Subclass APIing:
-Some models involve loops, varying hsapes, conditional branching, and other dynamic behaviours, for such is subclsses
-HAHAHAHA

Saving a model:
-Saving and loading a trained model:
	model.save("my_keras_model.h5")
	model = keras.models.load_model("my_keras_model.h5")
	
Callbacks: 
-Go into with the fit call and are called at the end of each epoch
2. As a early stopping which interrupts training when measures no progress on the validation set for a number of epochs (defined by the patience argument), and it will roll back to the best model, if not enabled uses the recent model

early_stopping_cb  = keras.callbacks.EarlyStopping
	(
		patience=10, 
		restore_best_weights=True
	)
history = model.fit(..., callbacks=[early_stopping_cb])

whichs waits for 10 epochs for which until there is no improvement and also restore the best one

3. If we use validation set during trianing, and want to write the model if it is best so far then,

checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras.model.h5")
history = model.fit(..., callbacks=[checkpoint_cb])


VISUALIZATION:
-

Fine-tuning hyperparamters:
-Simplest way of sklearn gridsearchcv or randomizedsearchcv:
-First step is to wrap the Keras models in objects that mimic regular scikit-learn regressors which we start by creating a function that will build and compile a Keras model given a set of hyperparamters:

#Here looks like each layer will have same number of neurons:

def build_model(n_hidden, n_neurons, learning_rate, input_shape):
	model = keras.models.Sequential()
	model.add(keras.layers.Flatten(input_shape=input_shape))
	for layer in range(n_hidden):
		model.add(keras.layers.Dense(n_neurons, activation="relu"))
	model.add(keras.layers.Dense(10, activation="softmax"))
	model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd")
	return model
	
keras_clf = keras.wrappers.scikit_learn.KerasClassifier(buildmodel)

-Guideliness:
4. Number of hidden layers:
	-MLP with one layer can model complex function provided it has enough neurons
	-But fact is that deep networks have a much higher paramter efficiency than shallow ones: cano model complex functions using exponentially fewer neurons than shallow nets allowing them to reach much better performance with the same amount of training data
	-Lower hidden layers model low level structures (eg line segments of various shapes and oreitnetaions), intermediate hidden layers combine these low level structures to modeli ntermediate level structures (squares, circles) and highest hidden layers and ouptut layer combine these intermediate structures to model high level strcuture eg faces
	-Not only take advantages of hierarchhieal datas but also improes ability to generalize to new datasets, if face trained net is there then you can direclty use lower level layers for hairstyle recogniztion instead of scratching out: called transfer learning
	-Usually without transfer leanring its better to use one or two level of hidden layers only
	-Can also increase the number of layers until it starts overfitting
	
5. Number of neurons per hidden layer:
	-Neurons count in input and output layer is determined by the application
	-For hidden layers is a common practice to size them to from a pyramid with fewer and fewer neurons at each layer - the rationale being that many lower level features can coalesce into far higher level features, like 300 to 200 to 100
	-However this practise is old and you can get just better results using same number of neurons in each layer, upside being that you need to train only one hyperparamter
	-Alo you can increase the number of neurons until the model starts overfitting
	
6. Midway:
	-A simpler approach is to use some larger number of layers and neurons than required and use early stopping to stop the overfitting and other regularization techiqnues to be coming


7. Learning rate;
	-A optimal lrearning reate is half of maximum learning rate (ie the rate above which the training algorithm diverges)
	-Simpler approach is to start with a large value that makes algorithm diverge then divide this value by 3 and try again, and repeat until the algorithm stops diverging
	
8. Batch size:
	-Usually lower than 32, as small batch size ensures fast, although a large batch size gives more precision but that does not matter as optimization landscape is quite complex, 
	-Also smaller batch size is also bad because it does not allow for optimizaion opporutinities
	
9. Number of epochs
	-Does not really matter just use early stopping
	
10. Acitavtion functions:
	-Generlaly ReLU for inner layers and outer layers is dependents

11. Optimizer:
	-WHOOOO?

TRAINING DEEP NEURAL NETWORKS: -Remember the graident of loss fucntion was such that the difference between actual and predicted was trnasformed by the transpose of example matrix -In deeper networks harek neurons ko aafno weights hunxa ra sab neurons aafno aafno gradient hunxa hola -Lets do back propagation now: -Since hidden layers dont have a target output cant define an error function for gradient descent -Any error function for that node will be dependnet on the values of the paramters in the previous layres (inputs) and following layers -Notation:

	x[l] = g[l](W[l]@x[l-1] + b[l]) 
	
	//This is equivalent to one example case of earlier linear regression and not for whole examples wala case
	
The subscript [l] denotes the belonging to the layer l: 
	x is the output
	W is the weight matrix of the neurons arranged row wise
	b is the intercept matrix of each neurons
	g is the activation function on that layer

so the overall output for N layer (excluding the input layer)

	x[N] = g[L](W[L]g[L-1](W[L-1]...f[1](W[1]x))) //probably b is not here

-Note the cost function is expected loss over all the examples so is a function of whole network weights from input to output layers
-The graident descent way is to take gradeint of this loss
-The problem is actually this only where we have no idea how to take the gradeint directly
-The graident of the cost function gives some direction but instead of thinking it like that we are good when we think it to be measure of sensitivity to the corresponding weight


-Some complex problems such as detecting hundres of types of objects in high resolution images need to train deeper DNN, perhaps with 10 layers or much more, each contatinig hundres of neurons is not a walk in park
-Problems:

1. Trick vanishing gradients problem (or the related exploding gradients problem) that affects deep neural networks and makes lower layers very hard to train
2. Not enough data for such a large network 
3. Training may be extremely slow
4. A model with millions of parameters would severely risk overfitting the training set



5. Vanishing and exploding gradeints:
-Popular before was sigmoid and weight initialization with normal of 0,1 which was shown that going forward in the network, the variance keeps increasing after each layer until the activation function saturates at top layer made worse by the mean 0.5 of sigmoid
-Reason seems to be that the variance of the outputs of each layer is much greater than the variance of its inputs
-Need signal to flow properly in both directions in the forward when making prediction and in reverse when backpropagating gradients, dont want signal to die out or explode which is by making variance of outuputs of each layer have equal variance of its inputs, also for backward direction
-Not possible unless fan in = fan out, but they proposed a good compromise: the connection weights of each layer must be initialized randomly as described:

Xavier:
Normal distribution with mean 0 and variance = 1/fan_avg

Uniform between -r and +r with r = sqrt(3/fan_avg)

-If do fan_in instead of fan_avg you get LeCunn initialization

Others are:

Name, activations, variance

Glorot		None, Tanh, Logisitc, Softmax		1/fan_avg
He		ReLU and variants			1/fan_in
LeCun		SELU					1/fan_in

Non saturating activation functions:
	-ReLU does not saturate for positive values and also is quite fast to compute. Not so perfect as suffers from a problem known as dying ReLUs; during training some neurons effectively die, menaing they stop outputting anything other than 0 especially for large learning rate, a neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set
	-Leaky relu is max(alpha z, z), zero part of relu lai slopy garauni, alpha=0.01 means less leak, or huge leak of 0.1 also better performance
	-Can put alpha as a learned parameter through backpropagation performs strongly on large but overfits on smallers
	-New activation that better than all the ReLUs is exponential linear unit ELU = {alpha (exp(z)-1) if z < 0 || z if z >= 0}, yesko chai tail halka exponential type xa, which has few major differences: it takes negative values when z < 0 which allows unit to have an averaeg output closer to 0 which helps alleviate the vanihsing gradients problem, alpha is generally 1; second it has non zero gradient for z<0 which avoids dead neurons problem; third if alpha is 1 then the function is smooth everythwere including at z=0 which speeds graident desent; sad is that its hard to calculate but faster to train, problem is slower inference in deployment
	-....


Batch normalization:
	-To add an operation in the model just before or after the activation function of each hidden layer
		>simply zero-centering and normalizing each input
		>then scaling and 
		>shifting the result using two new paparmeter per vectors per layer
	-In many cases if you add a BN layer as the very first layer of your neural network you do not need to starndarize your training set, but mean is not actually possible due to batches since it looks only one batch at a time and it can also rescale and shift each input feature
	
		mu_B = mean of vectors xi
		
		sigma_B^2 = variance of xi from mean
		
		normalized i's = (xi-mu_B)/sqrt{sigma_B^2+e}, e is for avoiding div by 0
		
		final shifted scaled i's = g hadamard x + beta
		
	-The model will learn these shifting and scaling parameters using back propagation
	-Wait what about at the test time? since we had to make predictions for individual instances rather than for batches of instances where we have no mean or standard deviation; even if we have a batch of instances it may be too small or instances may not be indepently and identically distributed
	-One solution is to wait until the end of traiing then run the whole traiing set through the network and compute mea and standard deviation of each input of the BN layer, these final input means and sd can then be used instead of batch input  eans and sds when making predictions
	-However it is often preferred to estimate these final statistiics during training using a moving average of the layers input means and sds, to sum up four paramter are learnt the first two from back propagation and other two are estimated using an exponential moving average
	-This reduced the vanishing graidnets to the point they could use saturating activation functions such as tanh and even the logissitc
	-Also acts like a regularizer reducing the need for the other regualizration technqiues such as dropout
	-Again the same thing that each epoch takes more time when you use batch normalization however convergence is much faster
	-keras ma BatchNormalization() bhanni euta function hudo raixa
	-The batch layer has 4*no of inut layers amount of paramters, half of which are not affected by backprop so termed untrainables
	
	-Here the batch is used after activatio but authors argued to used to before which one to choose seems to depend on tasks
	-With this you have to separatemly make activation layers and can remove the bias terms using user_bias=False because its acommated by batch layers shift paramters
	
	Hyperparamters:
	-Batch layers has quite hyper paramters, defaults are good but sometimes need to tweak momentum, which is used when updating the exponential moving averages: given a value v (a new vector of input means or sd computed over the current batch), the running average vhat is updated using:
	vhat = vhat * momentum + v*(1-momentum)
	
	-Usually momentum is close to 1- for eg 0.9 or 0.99 meaning revious vhat has more contribution
	-Another is axis and something about it...
	-Most used layer in deep neural network to the point it is often omitted in the diagrams as it is assumed after every layers
	-However a recent payer may change as they intorudce wieght initliaation techinque they manage to train a very deep neural netowrks 10k layers with BN
	

Gradient clipping:
	-Anohter popular to lessen the exploding graidents probelm is to simply clip the gradients during backprop so that they never exceed some threshold called gradient clipping
	-Most used in rnn as BN is tricky there
	-In keras implementing gradient clipping is a just a matter of setting the clipvalue or clipnorm argument when creating an optimizer:
	
	optimizer = keras.optimizers.SGD(clipvalue=1.0)
	
	-This will clip every component of the gradient vector to value between -1 and 1 which means that all the partial derivatives of the loss with regards to each and every trainable parameters will be cliped between -1 and 1		
	-The threshold is a hyperparamter you can tune
	-It may change the orientation of the gradient vector which works in practise but to make sure that the orientation is not changed can do clipnorm=1.0 which will cheip if its l2 norm is greater than the threshold you picked preserving its orientation
	

Reusing pretrained layers:
	-Reuse the lower layers of the pretrained network which not only boosts training but also require less data
	-Works best when the inputs have similar low level features; the more similar the tasks are, the more layers you want to reuse, need to find right nuber of layers to reuse
	1. Try freezing the all the reused layers first (probably say we reused all the layers) by making their weights nontrainiable, then train your model and see how it performs, then try unfreezing one or two of the top hidden layers to let backprop tweak them and see if performance improves
	2. If you still cannot get good perofmance and have training data try replacing the hidden layers or even add more
	
	#Given some softmax model A need to learn model B that does simialr logsitic
	
	model_A = keras.models.load_model(...)
	model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
	model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid")
	
	-now model_B_on_A refers to model_A so will affect model A as well, to avoid which
	
	model_A_clone = keras.models.clone_model(model_A)
	model_A_clone.set_weights(model_A.get_weights())
	
	-For first few epochs:
	
	for layer in model_B_on_A.layers[:-1]:
		layer.trainable = False
	model_B_on_A.compile(...)
	histoy = fit(...
	
	-Now can train the model for few epochs then unfreeze the reused layers which requires compiling the model again and continue to fine tune the reused layers for task B
	
	for layer in model_B_on_A.layers[:-1]:
		layer.trainable = True
		
	optimizer = ..sgd(lr=1e-4)
	compile it
	history = fit(...		
	
	-The result is pretty good than would have achieved with similar result
	-But the author said that he cheated by trying many configurations until got one ie did torturing the data until it confesses
	-This is bcoz transfer learnign does not work very well with small dense networks; works best with deep convolutaitonal networks
	

Unsupervised pretraining:
	-No data and not even pretrained model then what?
	-Geofftrey hinton used to revive deep learning. We train each layer one by one and finally to train last layer we use available labelled data
	-Used are boltzmann networks and autoencoders


Pretraining on auxiliary task:
	-Do pretraining on auxillary task for whihch labeled data is available, to recognize faces can reuse lower layers from the networks that were trained to classify whether the two images are of same person
	

Faster optimizers:
	-So far four ways to speed up training and reach a better solution
	1. applying ood initialization strategy for the conneciton weights
	2. Using a good activation
	3. Batch normalizaiton
	4. Resuing parts of pretrained
	
	5. Momentum:
	-with motivation that a ball picks up momentum until it eventually reaches terminal velocity with some friction
	-regular gradinet descent takes small regular steps without caring what previous gradients were
	-Momentum cares a great deal, at each iteration, it subtracts the local gradient from the momentum vector m multiplying by learning rate eta, and it updates the weights by simply adding this momentum vector
	-Hence the gradient is used for acceration and not as speed
	-Also there is a beta paramter for hypertuning usually 0.9'ish for simulating the friction
	
	m vector <- beta * m vector - eta * gradient of J(theta)
	theta <- theta + m	
	
	-Can verify that if gradient remains contans the terminal velocity, maximum size of the weight updates is equal to the that gradient multiplied by the learning rate eta multiplied by 1/(1-beta)
	-Say if beta=0.9 then the terminal velocity is equal to 10 times the graident times the learning rate, so Momentum optimization ends up going 10 times faster than gradient descnet
	-Hence it escapes the plateaus much faster
	-In keras:
		optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)
	
	6. Nesterov accelerated gradient:
	-is almost always faster than vanilla momentum
	-idea is to measure the gradient of the cost function not at the local position but slightly ahead in the direction of the momentum as:
	
	m vector <- beta * m vector - eta * gradient of J(theta + beta*mvector)
	theta <- theta + m
	
	-This works because in general the momentum vector will be ponting in the right direction towards the optimum so it will slightly more accurate to use the gradient measured a bit farther in that direction rather than using th gradient at origial position
	-In keras:
		optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)
		
	7. AdaGrad:
	-it scales the gradient vector along the steepest dimensions
	
	s <- s + grad J(theta) hadamard J(theta)
	theta <- theta - eta * grad J(theta) hadamard div \sqrt{s+e}
	
	-First step accumulates the square of the gradients into the vector s, second stpe is almost identical to gradient descent but scaled down by factor of sqrt(s+e), e is for smoothing
	-This is like decaying the learning rate but faster for steep dimensions than for dimensions with gentler slopes
	-Is adaptive learning rate
	-Requires much less tuning of the learning rate hyperparamer
	-Stops too early when training neural nets as the learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum
	
	8. RMS prop:
	-Adagrad slows down a bit too fast and ends up never converging to the global optimum, rms fixes it by accumulating only the graidnet from the the most recent iterations using exponential decay
	
	s <- beta*s + (1-beta) [grad J(theta) hadamard grad J(theta)]
	theta <- theta - eta* grad J(theta)hadamard div sqrt(s+e)
	
	(so far did momentum types and adaptive gradients)
	
	-The decay is usually set to 0.9 with keras:
	
	optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9); i think beta is rho
	
	-Works almost always better than AdaGrad and was usually choosen for research until adam came out

	9. Adam:
	-combines the idea of momentum optimization and RMSprop
	-Just like momentum optimizaion it keeps track of an exponentially decaying average of past gradients and just like RMSprop keeps track of an exponentially decaying average of past squared gradients
	
	m <- beta_1 * m - (1-beta_1) grad J(theta)
	s <- beta_2 * s + (1-beta_2) grad J(theta) hadamard J(theta)
	mhat <- m/(1-beta_1^t)
	shat <- s/(1-beta_2^t)
	theta <- theta + n*mhat div sqrt(shat + e)
	
	t is iteration count of epoches
	
	-beta1 is 0.9 and beta2 is usually 0.999
	-the normalization is due to that m and s are intiialized to 0 so they will be biased toward 0 at the beginning of the training
	
	Two variants:
	i. Adamax: 


	NOTE: Nadam outperforms adam but is sometimes outperformed by rmsprop
	-sometimes dataset may be allergic to adaptive gradeints so if model performance is poor try using nesterov accelerated gradient
	-Going research: ada bound, second order derivatives
	
	
Sparsing models:
	-If need fast model at runtime or less memory then prefer to end up with sparse model instead
	-One way is to train the model and get rid of the tiny weights (set them to 0) but will not lead to very sparse model and may degrade the models performance
	-Better option is to apply strong l1 regularization during traiing as it pushes the optimizer to zero out as many weights as it can
	-Sometimes does not work so another last option is to apply dual averaging, often called the regularized leader (FTRL)
	
Learning rate scheduling
	-If set training too high, training may actually diverge
	-if set too low, will eventually converget to optimum but take a very long time
	-Slightly high progress will quickly at first but will end up dacning around the optimum never really settling down
	-Taking time means computing power; data ta haina hola sayad
	
	-One approach is to start with learning rate, and divide it by 3 and traiing algo stops divering, you will not be too far from the optimal learning rate, which will learn quickly and converge to good solution
	
	...
	
Regularization:
	-Flexibility of parameters means that it is prone to overfitting the training set so need regularization
	-Early stpping is one of the best
	-Even though batch normalization is designed for vanishing or exploding gradients acts as a good reluaziaer
	-Other popular regularization for neural networks l1 and l2, dropout and max-norm
	
	layer = keras.layers.Dense
		(
		100,
		activaiton="relu",
		kernel_initializer="he_normal",
		kernel_regularizer=keras.regularizers.l2(0.02)
		)
		
	-l2 function returns a regularizer that will be called to compute the regularizatin loss at each step during traiing and this regularization is then added tot he final loss
	
	Tip:
	Since you will typically want to apply the same regulaizer to all layers in your network as well as the same activaiton function and the same initializaion strategy in all hidden layers
	-Can use functools.partial() which create a thin wrapper for any callable with some default argument values
	
	from functools import partial
	RegularizedDense = partial(	keras.layers.Dense,
					activaiton="relu",
					kernel_initailzer....
					)
					
	model = keras.models.Sequential([
		flatten
		RegularizedDense(300),
		...
		)
		
	Dropout:
	-Highly successful, gives 1-2% accuracy boost by adding dropout
	-At every traiing step, every neuron including inputs but always excluding outputs has a probablity p of being temporarliy dropped out meaning it will entirely ignored during this traiing step but may be active during the nexzt step
	-Hyperparamter p is called the drop out and is typically set 50%
	-Neurons trained with dropout cannot co-adapt with their neighboring neurons; they have to be as useful as possible on their own
	-They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons
	-They en up being less sensitive to slight challenges in the inputs, in the end you get a more robust network that generlizes better
	....
	
	
	
					
Summary:
	>If your model self normalizes:
		-if it overfits the traiing set then you should add alpha dropout (and always use early stopping as well)
		-do not use other regualarization methods or else they would break self-normalization
		
	>If your model cannot self normalizes (it is a recurrent net or it contains skip connections)
		-you can try using elu (or anohter activaiton) instead of selu, it may perform better, make sure to change the initalization method accordingly
		-if is a deep network you should use batch norlmaliztion after every hidden layer if it overfits the traiing set you can try using maxnorm or l2 regualariztion
	
	>If you need a sparse model, you can use l1 regularization (and optinally zero out the tiny weights after traiing), even sparset then FTRL instead of nadam along with l1, in any case this will break self normalizatin so you will need to swtich to BN if your model is deep
	
	>Need a low latency model then you may need to use less layers, avoid batch and possibly replace selu with leaky relu 
	
	>if you are building risk-sensitive or inference latency is not very importnant you can use MC dropout to boost performance and get more reliable probablity estimtes along with uncertailnty estimates

Convolutionals -Recognizing image is easy for humans because perception largely takes place outside the realm of our consicouslness within specialized visual, auditory and other sensory modules in our brains -By the time sensory information reaches our consiscouoss, it is laready adorned with high level features; for ex, when you look at a picture of a cute puppy, you cannot choose not to see the puppy -CNN emerged from brain visuals (i think each part of brain lai model garna types of networks aauxa, arko eg; probably memory networks) -Many neurons in the visual cortex have a small receptive field meaning they react only to visual stimuli located in a limited region of the visual field -The receptive fields of different neurons may overlap and together they tile the whole visual field

Convolutional layer:
	-The neurons in the first convolutional layer are not conencted to every single pixel in the input image but only to pixels in their receptive fields in turn, each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer
	-This architecture allows the network to concentrate on small lowlevel features in the first hidden layer then assemble them into a larger higher levle features in the next hidden layer and so on (euta rectangle ko pixels lai group garera euta neuron ma laijani and so on)
	-Each layer can be represented in 2D itslef which makes t easier to match neurons with their corespondiong inputs
	-Euta i,j th neuron xa ni teslai sidhai previous layer ma map garni ani tyo neuron ko surround ma rectangle banauni bhane tyo rectangle ma parne neurons ma depend garxa, meaning that i,j th neuron is connected to neurons in rectangle of row from i:i+fh-1 and column j:j+fw-1
	-Corner tira ko lagi zero padding garnu parne hunxa to make them use more than once, aba ijth row and column chai tei extended image bata suru hunxa
	-Large input image bata smaller ma janu ni milxa using strides as
	-ijth pixel uses i*sh:i*sh+fh-1 rows and j*sw:j*sw+fw-1 columns

input layer bhayo - window size bhayo - output layer size bhayo

Filters or convolutional kernels:
	-A neurons weights can be repesented as a small image the size of the receptive field
	-Example ma chai duita neurons ko image deko xa where there is central horizonal or vertical line in each of them meaning bich ko bahek aru sab weights are zero
	-Euta neuron ko filter nikaliyo ni aba sab neurons le tei filter use garne bhane euta feature map banxa hence we say that NN learnt this feature map which is nothing but weights
	-So each neuron in feature map has same weights
	-Here horizontal lines ya vertical line xa ni tiniharu le tei lines haru lai enhance garxa aru lai chai blurr gardinxa
	

Stacking feature maps:
	-A convolutaional layer has multiple filters, you decide how many and it inputs one feature map per filter
	-A neurons receptive filed is same as earlier but extends across all the previous layer feature maps
	-In short a convolutaional layer simultaneously applies multiple trainable filter to its inputs making it capable of detecing multiple features anyhwere in its inputs
	-THe neurons in a feature map share the same paramter so reduces the number of paramaters in the model
	-Once the CNN has learned to recognize a pattern in one location it can recognize in any other location
	
	
Keras:
	-a image is typically represened as a 3D tensor of shape [hieght, weidth, channels]
	-a minibatch is represented as a 4D tensor of shape [minibatch, size, height, width, channels]
	-Weights of a layer are rpresented as a 4D tensor of sahep [fh, fw,, fn', fn]
	-Probably fn is the number of maps in current layer while fn' in previous layer
	-implemented using im2col


Notes:	
-One way to view 3D tensor is to treat as containing D channels of matrices
-Suppose the input in the lth layer is an order 3 tensor with size Hl*Wl*Dl so a convolutional kernel is also an order 3 tensor with size H*W*Dl, when we overlap the kernel on top of the input tensor at spatial location (0,0,0)

Memory requirements:
	-As reverse passs of backprop requires all the intermediates computed during forward pass
	-A conv layer with 5*5 outputting 200 features, 
	

Pooling layer:
	-Goal of pooling is to subsample or shrink the input image in order to reduce the computational load, memory usage and the number of paramters thereby limiting the risk of overfitting
	-just like in convolutional layer, each neuron in a pooling layer is connected to the outputs of a limited number of neurons in the previous layer located within a small rectangular field
	-pooling has no weights it just caculates averages or max out of the rectangular window and gives same number of channel
	-Max pooling introduces some level of invaraince to small translations; moreover also offers a small amount of rotational invaraince and a slight scale invariance
	-Donwsides being that it is very destructive: and in some applications invaraince is not desirable, for ex for semantic segmentational which is the task of classifying each pixel in an image depending on the object that pixel belongs to: obviously if the input image is translated by 1 pixel to the right, the output should also be translated by 1 pixel to the right, the goal here is equivariance not invariance: a small change to the inputs should lead to a corresponding small change in the output
	-Although max pooling loses information it preserves only the stronges features getting rid of all meaningless oens
	-Also could be done depthwise rather than spatial wise but not so popular though
	-One last type of pooling layer is the global average pooling layer
	
	
	
CNN Architectures:
	-Too many hyperparamters to tune bro wtf to do
	-In CNN nets the image gets smaller and smaller as it progresses through the network but also typically gets deeper and deeper ie more feature maps thanks to convolutional layers
	-At the top a regular feedforward is added composed of few fully connected laers and the final layer outputs the prediction
	-A common mistake is to use convolutional kernels that are too large, for ex: instead of using a convolutioanl layer with 5*5 kernel it is generally preferaable to stack two layers with 3*3 kernsl which will use less paramters and require less computations and usually perform beter
	-One exception to this is first layer which can typically have a large kernel usually with stride of 2 or more to reduce the spatial dimension of the image wihtout losing too much informatio and since the input image only has 3 channels in general, it will not be too costly
	
	
State of the art:

1. Alexnet:
	-Stacked convolutional layers directly on top of each other instead of stacking a pooling layer on top of each convo layer
	-To reduce overfitting the authors used two regularization technqiues: first they applied dropout with a 50% dropout rate during triaing to the outputs of layers F8 and F9
	-Second they performed data augentation by randomly shifting the traiing images by various offsets, flipping them horizontally and changing the lightling condtions
	-Data agumenttaion increases the size of the triaing set by genearting many realistic varaints of each training instance
	-This reduces overfitting making this a regularization techniques
	-Generated instances must be realistic as possible and be learnable like white noise is not
	-Alex net also uses competitive normalisation step immediately after relu of C1 and C3 layers called local response normalization
	-The most strongly activated neurons inhibit/hinder other neurons lcoated at the same position in neighboring feature maps (such has been observed in biologial neurons); this encourages different feature maps to specialize pushing them apart and forcing them to explore a wider range of features ulitmately improving generalizations
	-...
	
	-A variant of alexnet called zfnet was developed with a few tweaked hyperpamaratesrs
	
2. GoogleNet:
	-Winner of 2014 using much deeper than previous CNNs made possible by subnetworks called incpetion modules wihch allow googlenet to use paramters much more efficiently than previous architetures: goolenet actually 10 times fewer paramters than alexnet, roughly 6 million instead of 60 millino
	-Inception net is made by depth concating four of the ways:
	1. Convolution of 1*1 with stride 1 and same padding
	2. Convolution of 1*1 with stride 1 and same padding + Convolution of 3*3 with stride 1 and same padding
	3. Convolution of 1*1 with stride 1 and same padding + Convolution of 5*5 with stride 1 and same padding
	4. Max pooling of 3*3 with stride 1 and sam epadding + Convolution of 1*1 with stride 1 and same padding
	-Every single layer uses a stride of 1 and SAME padding even the max so their outputs all have the same height and width as their inputs, this makes it possible to concatenate all the outputs along the depth dimension in the fnal depth concat layer ie stack the feature maps from all four top convolutional layers
	-The 1 by 1 convo layers dont really capture any features as they look at only one pixel at a time but they serve following purposes:
		>They cannot caputre saptial patterns but can capture patterns along depth dimension
		>they are configured to output fewer feature maps than their inputs so they serve as bottleneck layers meaning they reduce dimensionality which cuts the computaional cost and the no of paramters speeding up training and improving generalization
		>Eac pair of convo layers (1*1, 3*3 and 1*1, 5*5) acts like a single powerful convo layer capable of capuring more compelx patterns indeed instead of sweeping a simple linear classifer across the image as single convo layer does, this pair of convo layers sweeps a two layer neural network across the image
	...
	-Some deep shit here
	
3. Resnet:
	-top-5 error rate under 3.6% using an extremely deep CNN composed of 152 layers
	-Key to train such a deep is to use skip connections: the signal feeding into a layer is also added to the output of a layer located a bit higher up the stack
	-When traiing a nerual net, the goal is to make it model a targe function h(x), if you add the input x to the output of the network then the network is forced to model f(x) = h(x) - x rather than h(x), this is called residual learning
	-When you initialize a regular neural network, its weights are close to zero, so the network just outputs values close to zero, if you add a skip connection the resulting network just outputs a copy of its inputs; in otherwords it just models the identity function
	-If the target function is fairly close to the idenity which is often, this will speed up training considerably
	-
	
	
4. Xception:
	-Another for 17k classes, it merges ideas of googlenet and resnet butu replaces inception modules with a special type of layer called a depthwise separable convolution

5. CuImage: is ensemble of many techqniues and quite complex of winning 2016

6. SENet:
	-Simpler winner of 2017 called squeeze and excitation network
	-extends existing alex and resnet and boosts their performance
	
IMPLEMENT RESNET IN KERAS:

Using pretrained model (code in notebook)
	-Use resnet from keras.applications
	-Even if the images do not belong to classes of imagenet can still benefit from it

Transfer learning:
	-Want to build an image clasifier but dont have enough traiinig data then often a good idea to reuse the lower layers of a pretrained model
	-Is a gpu requirement

Classification and localization:
	-Localizing an object in a picture can be expressed as a regression task: to predict a bounding box around the object, a common appraoch is to predict the horizonal and vertical coordaintes of the objects center as well as its height and width, means 4 numbers to predict
	-They say it does not require much change to the (previous done flower classification model)
	-Euta example deko xa jasma xception ko lower layers used for prediction both class and box around the flower
	-They say to use tools for annotating the boxes around the flower, box ko 4 ota values are punished by mse but there is another better one called intersection over union, it is the area of overlap between the predicted bounding box and the target bounding box, divied by the area of their union
	-What if images contain multiple objects?
	-This work of localizing multiple objects and predicting class is called object detection
	
	-Old way was to take CNN that was trained to classify and locate a single boject then slide it across the image
	-The image is chopped into grid of boxes say 8*8, then a CNN slides across all 3*3 regions; then it would detect same flowers multiple times in differenet locatiosn, half matra aafno 3*3 region ma paryo bhane ni detect ta garihalxa hai
	-Then again 3*3 region sakesi hami arko large region ma ni janu parne hunxa hai
	-Since it detects and makes bounding box probalby at different postion for same image so some post processing will be needed to get rid of the unncessary bounding box, a common approach is non-max suppression
	1. First you need to add an extra objectness output to your CNN, to estimate the probablity that a flower is indeed present in the image alternatively you could add a no=flower class but htis usually does not work as well (class predict garni case ma ta flower hunxa nai bhanni assume garinxa ni ta); then use sigmoid and train it using binary_crossentropy loss then get rid of all boxes for wihhc objectness is below some threshold; this will drop all boudning boxes that dont actually contain a flower
	2. Second find the bounding box with the highest objectness score, and get rid of all other ounding boxes that overlap a lot with it (eg; with an IOu greater than 60%)s,
	3. Third repeat step 2 until there are no more bounding boxes to get rid of
	
	-Works pretty well but requires running the CNN many times, so it is quite sslow; fortunately there is a much faster way to slide a CNN across an image; using fully convo network
	
(bounding box garauna ta 4 ota coordinates predict garni bhaihalyo, only thing is making dataset ready)


Next is segmentation now:
	
FCNs:
	-Sab images ma box banauni kaam haina ta
	-The pointed idea si that you could replace the deep layers at the top of a CNN by convo layers, if 200 nuerons dense sit on top of convo layer that outputs 100 feature maps each of size 7*7 then is to replace it with a convo layer using 200 filters each 7*7 and with valid padding, this will output 200 feature maps each 1*1
	-The shape differes than that of the dense layer
	-While a dense layer expects a specific input size since it has one weight per input feature, a convo layer will happily process images of any size however it does expect its inputs to have a specific number of channels, since each kernel contains a different set of wiethgts for each input channel
	-since FCN contain only convo layers and poooling it can be trained and execute on images of any size
	-Ex: suppose we already trained a CNN for flower classification and localization, it was trained on 224*224 images and outputs 10 numbers: 
		>0 to 4 are sent through softmax and are class propabiblites; 
		>output 5 is sent thorugh logistic is objectivenss score
		>output 6 to 9do not use any activatio and they represent the bounding box center coordiantes, and its height and width
	-We can convert its dense layers to convo layers, we can just copy the weights from the dense layers to the convo layers, alternatively we could have conveted the CNN into an FCN before traiing
	
	-(Exanmple deko xa tyo book ma)
	-YOLO is some very fast that can be used in videos: with explanatino given
	-R CNN, Fast CNN and whatever the fuck are there IDKs

Semantic segmentation:
	-UNet?

Data analysis:

I. Numpy:
	-ndarray.shape; ndarray.dtype

	Creating:
	-np.array converts sequence type to ndarray either infers a dtype or specify it; copies the input data by default
	
	-np.arange range for numpy array
	
	-ones, zeros, ones_like, zeros_like: given shape and dtype produces ndarray or like the given ndarray
	
	-eye, identity: create a square identity matrix
	
	Casting:
	-arr.astype(...) copies the array
	
	Swapping axes:
	-arr.reshape(..) returns a view on underlying data without copying naything
	-arr.T is a special type of reshaping
	-arr.flatten() makes the array scalar meaning 1D
	
	1. Vectorized operations:
	-Same sized operations is elementwise
	-Array and scalar operations (operators and ufuncs) is propagation
	-When operating on two arrays, NumPy compares their shapes element-wise starts with trailing dimension and works its way left, two dimensions are compatible when
		>they are equal, or
		>one of them is 1
	
	2. Sorting:
	-arr.sort() in place sort for each row
	
	3. Indexing and slicing
	-Single value indices: arr[x]
	-Multiple value indices: arr[[...]]
	-Some range type of incides: arr[a:b]
	-Boolean mask bhako indices: arr[mask]
	
	Note:
	-Are views on the original array means that any modifications to the view will be reflected in the source array
	-If assign a scalar value to a slice, the value is propagated to the entire selection
	
	4. Stats:
	
	-Following if not provided axes probably calculates on the flattened array
	
	arr.min()
	arr.max()
	arr.sum()
	arr.mean()
	arr.std() and arr.var()
	
	arr.argmin()
	arr.argmax()

	arr.cumsum()
	arr.cumprod()
	
	For booleans:
	bools.any()
	bools.all()		
	
	5. Linear algebra:
	
	np.diag(x)
	np.outer(x,x)
	np.inner(x,x)
	
	m@m
	combines consecutive vector space transformations
	
	np.linalg.matrix_power(m,n)
	combines transformations to the nth power of self
	
	np.linalg.inv(m)
	reverses a square transformation if no information was lost
	
	np.linalg.matrix_rank(m)
	is the dimension on the resulting vector space where trans lives
	
	np.linalg.det(m)
	is the scaling factor of hypercube for square transformations
	
	q,u = np.linalg.qr(m)
	qr decomposes square X into QU where Q is a orthogonal such that Q.T is the inverse of Q: m = q@u
	
	q,a = np.linalg.eig(m)
	np.linalg.eighvals(m)
	eig decomposes square X into QAQ' where Q is a trans whose columns are eigvecs: m = q@diag(a)@inv(q)
	
	q,a = np.linalg.eigh(m)
	np.linalg.eigvalsh(m)
	special case for the symmetric or hermitian matrix where Q is the orthogonal thus Q'=Q.T: m = q@diag(a)@q.T
	
	s,v,d = np.linalg.svd(m)
	svd decompsoes X into vaw.T where v and w are eigen decomposition of L-R symmetric matrices: m = s@diag(v)@d


















	

2. Pands:
	>Series:
	-is one dimensional array-like object containing an array of data and an associated of data labels called its index
	-Can be made from
		>list with default indices or given indices
		>dicitionary
	
	>DataFrame:
	-represents a tabular spreadsheet-like data structure containing an ordered collection of columns each of which can be a different value type\
	-Can be made from 2d lists and equivalents
		
	>Offers dtypes such as int, float, objects, categories, datatime etc.
		
	Properties for both:
	
	-Open the csv/tsv or whatever files as:
		
		df = pd.read_csv('path', sep='\t')
	
	
	-df.dtypes is a Series of column_name:dtype
	
	-df.head(), df.tail(), df.info()
	
	-df.columns, df.index is an Index which is a list-type

	-df.values is a multidim ndarray where each row are the data esxcept the index 
	
	-df['column_name'] or df.column_name is Series of index:columns
	
	-df[['col1', 'col2', ...]] is a DataFrame type of index:columns
	
	-df.loc[["Maine", "Alaska"], ["deaths", "size", "deserters"]]
	
	-df.iloc[2:7, 2:6] and can do other form of indexing or slicing available
	
		>The loc/iloc return a Series or DataFrame accordingly
	
	>To add a new column to the df just
		df["new"] = ...
	
	>Can apply functions to a column such as: single values
		max()
		min()
		sum()
		mean()

	>Special functions: Series or some shit
		value_counts()
		describe()
		isnull()
		sort_values(['col1', 'col2', ...], ascending=False)
	
	>Others:
		df.str; is vectorized string accessor
		df.str.contains('...')
	
	>Apply a lambda function to a column as:
		col.apply(lambda x: float(x[:]))
	
	>Grouping by:
		df.groupby('column_name').sum()
		t = t.sort_values(['column_name'], ascending=False)
		t.head(5)
	
	>Masking:
		df[mask dataframe here comea from dataframe comparisons]
	
	
















	
3. Matplotlib:
	-Plots in matplotlib reside within a Figure object, can create one with		
	-Cnat make a plot with blank figure, must create one or more subplots using add_subplot:
	
	fig, ax = plt.subplots(2,10, figsize=(20,5))
	
	-returns figure object and ax(2*10) number of subplots objects
	
	ax[0,0].plot_types and all the shits happen here

	1. ax.plot(x, y): for point joining line to line
	
	2. ax.hist(x, bins=10): for categroies vs frequency graph
	
	3. ax.scatter(x, y, s=sizes, c=colors): all are the lists and it makes a 2D gaph
	
	X,Y = np.meshgrid(arr, arr): makes array of some thing
	Z = X**2 + Y**2
	ax.imshow(Z)
	
	-Dataframe plots:
	
	df.plot()
	df.plot(kind='hist')
	df.plot(kind='scatter')


	-Another way of subplotting is:
	
	plt.figure(figsize=(10,20))
	plt.subplots_adjus(wspace=0, hspace=0) #no space between the subplots
	
	plt.subplot(nrows, ncols, )





	
4. Pytorch:
	-See the movement for the MNIST one:
	
	1. Get the train and validset:
	
	transform = transforms.Compose(
	    [transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]
	)

	trainset = datasets.MNIST('./trainset', download=True, train=True, transform=transform)

	validset = datasets.MNIST('./validset', download=True, train=False,transform=transform)
			
	(torchvision package has datasets and transforms classes used here)
	
	-I think the general idea is that before making a dataset/validset we use transforms that do reshaping and augmentation and what not
	-You can make you own dataset in PyTorch as follows:
	
		>Need to create a class that inhertis the torch.utils.data.Dataset class and implement the following methods:
		
		__init__: initialize the dataset and load the necessary data, usually pass datapath and transforms here
		
		__len__: returns the total number of samples in the dataset
		
		__getitem__: retrieves a specific sample from the dataset (data, sample) at a given index, loads the sample data and returns it
	
	-The torchvision provides some transforms, also there is albumentations made for image classification and segmentation
	
	transforms = A.Compose([
	    A.Resize(width = PATCH_SIZE, height = PATCH_SIZE, p=1.0),
	    A.HorizontalFlip(p=0.5),
	    A.VerticalFlip(p=0.5),
	    A.RandomRotate90(p=0.5),
	    A.Transpose(p=0.5),
	    A.ShiftScaleRotate(shift_limit=0.01, scale_limit=0.05, rotate_limit=0, p=0.25),

	    
	    A.Normalize(p=1.0),
	    ToTensorV2(),
	])
	
	-Now to do transformations
		augmented = transforms(image=image, mask=mask)
		image = augmented['image']
		mask = augmented['mask']
		return image, mask
	
	
	2. Make the loaders as:
	
	trainloader = torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True)

	validloader = torch.utils.data.DataLoader(validset, batch_size=32, shuffle=True)
	
	3. Visualize your batch and labels:
	
	for batch, labels in trainloader:
	    print(batch.shape, labels.shape)
	    plt.imshow(batch[0].numpy().squeeze(), cmap='gray_r')
	    break
	

	4. Sequential models as follows:
	
	input_size = 784
	layer1 = 128
	layer2 = 64
	output_size = 10

	model = nn.Sequential(
	    
	    nn.Linear(input_size, layer1),
	    nn.ReLU(),
	    nn.Linear(layer1, layer2),
	    nn.ReLU(),
	    nn.Linear(layer2, output_size),
	    nn.LogSoftmax(dim=1)
	)

	#Notice that our model takes in input of size 784 which is 28*28
	
	-To create your own custom inherit the nn.Module class as:
	
	class MyNet(nn.Module):
		def __init__(self): #define the layers here
			super().__init__()
			...
			
		def forward(self, x):
			#how input x flows through the layers
	
	
	5. 
	
	criterion = nn.CrossEntropyLoss()

	images, labels = next(iter(trainloader))
	images = images.view(images.shape[0], -1)

	predictions = model(images)
	criterion(predictions, labels) 
	
	6.
	
	images, labels = next(iter(trainloader))
	images = images.view(images.shape[0], -1)

	predictions = model(images)
	loss = criterion(predictions, labels) 

	loss.backward()
	
	
	6. 
	
	optimizer = optim.SGD(model.parameters(), lr=0.001)
	epochs = 5
	for e in range(epochs):
	    running_loss = 0
	    for images, labels in trainloader:
		# Flatten MNIST images into a 784 long vector
		images = images.view(images.shape[0], -1)
	    
		# Training pass
		optimizer.zero_grad()
		
		output = model(images)
		loss = criterion(output, labels)
		
		#This is where the model learns by backpropagating
		loss.backward()
		
		#And optimizes its weights here
		optimizer.step()
		
		running_loss += loss.item()
	    else:
		print("Epoch {} - Training loss: {}".format(e, running_loss/len(trainloader)))
	
	
	8.
	
	torch.set_printoptions(precision=4, sci_mode=False)
	for batch, labels in validloader:
	    plt.imshow(batch[0].numpy().squeeze(), cmap='gray_r')
	    print("Actual label is", labels[0])
	    img = batch[0].view(1, 784)
	    logps = model(img)
	    print(logps)
	    break
	
	
	7. 	
			
	correct_count, all_count = 0, 0

	for images, labels in validloader:
	    for i in range(len(labels)):
		img = images[i].view(1, 784)
		with torch.no_grad():
		    logps = model(img)

		#logps is logged probablilites, so take exponent:
		ps = torch.exp(logps)
		
		#Find the labels by finding the argmax:
		probab = list(ps.numpy()[0])
		pred_label = probab.index(max(probab))
		true_label = labels.numpy()[i]
		if(true_label == pred_label):
		  correct_count += 1
		all_count += 1

	print("Number Of Images Tested =", all_count)
	print("\nModel Accuracy =", (correct_count/all_count))


LETS GO ONE BY ONE NOW:::::::

5. torch.utils.data.Dataset:
-Indicates a dataset object o load data from, PyTorch supports two different types of datasets:
	>map-style datsets
	>iterable-style datsets

-All datasets that represent a map from keys to datasamples subclass it
-All subclasses must overwrite __getitem__() supporting fetching a data sample for a given key
-You could just take in idx and return the corresponding (data, label) pair
-Optionally overwrite __len__()which is expected to return the size of the dataset
-For ex: such a dataset, when accessed with dataset[idx] could read the idx-th image and its corresponding label from a folder on the disk

6. torch.utils.data.iterableDataset:
-When data come from a stream
-Must overwrite __iter__() that represents a iterable over data samples; the way to implement iter is yield from or returing a iter object that implements next rey
-Particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data
-For ex: such a dataset, when called iter(dataset) could return a stream of data reading from a database, or whatever

Note: For iterable style datasets, data loading order is entirely controlled by user-defined iterable, allows easier implementation of chunk-reading and dynamic batch size (by yielding a batched sample at each time)

[One suggestion I found is to use map style whenever possible]

7. torch.utils.data.DataLoader:
-Dataloader supports automatically collating indiivdual fetched data samples into bathes via:
	-batch_size, shuffle, drop_last
	
	or,
	
	-batch_sampler (another object that returns a list of keys) 

-Now with these attributes we will have a list of indices that will make our batch, but how to exactly make a batch? i mean batch is a single tensor right? 
-Automatic batching is default which corresponds to fetching a minibatch of data and collating them into batched samples, if you do batch_size = None then we are disabling automatic batching (I dont think we generally do that)
-When we enable batching but specify collate_fn then:
-collate_fn, merges a list of samples to form a mini-batch of tensors from a map-style dataset (does automatically if only the elements are of same size)
-May use collate to achieve custom batching, ie collating along a dimension other than the first, padding sequences of various lenghts
(euta example bhaneko text data lai same size ma lagnu bhaneko maximum sentence herera padding garnu ho but if we collate we can do padding for maximum sentence in a single batch only, saving memory)




[aba usual shits ko lagi pandas haru use garera dataset banauna sakiyo haina ta]

TORCHVISION DATASET:

(many datasets many are subclasses of map wala dataset, usually specify root folder, download, train etc..)

-Almost similar APIs, two common arguments: transform and target_tranform:

<Will talk utilites here and then go to other helping libraries like albumentations>

ALBUMENTATION TRANSFORMS:


TORCHTEXT DATASET:

(many datasets each with somehow similar APIs again, use something called pipes, main arugments include src='dta' and split=('train', 'test'), according to splits returns test or train set or whatever

-Doesnot specify directly but i think it is like IterableDataset

<will talk helpful stuffs here and then go to third parties such as spacy>

8. torchtext.data.utils.get_tokenizer:
-Can accept argument as 'basic_english' only I guess for now

9. torchtext.vocab.build_vocab_from_iterator:
-Provide iterator that yields a list of tokens, usually make one from the dataset and the tokenizer we use
-Specials if need special symbols, eg specials=["<UNK>"]
-Now if you do vocab(list of tokens) it will return you the indices of the tokens
-If there is no token in vocab what index to return bro? set default by vocab.set_default_index(vocab["<UNK>])

SPACY TOKENIZER AND SHITS:







INPUT_DIM = len(SRC.vocab)

OUTPUT_DIM = len(TRG.vocab) ENC_EMB_DIM = 256 DEC_EMB_DIM = 256 HID_DIM = 512 N_LAYERS = 2 ENC_DROPOUT = 0.5 DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT) dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

\newpage (After this is fast ai) \subsection{Fusemachines (and related)}

Neural network:

-Talks about computational graphs and shit -Activation functions, advantages and disadvantages:

Sigmoid: +Introduces non linearity enables network to learn a complex mapping between input and output +Prevents blowing of the avtivation as it scales any input to value between 0 and 1

 -Suffers sharp daming of gradient during the backprop from deep hidden layers to the input layers, gradient saturation, and slow converegence
 -Gradient vaniseh when the input is large or small
 -The range of its derivative is (0,1/4] means that the value while backpropagating will be very small or at maximum it can be 1/4 and hence, it will result in vanishing graidnet problem if back propagated through deep neural network and hence, neural network will learn nothing
 -Optimization of network gets harder because the output isnot zero centered, it is cenrerted at 0.5

Hyperbolic tanh: +Output is zero centered ast the mapping scales the value between -1 and 1 +Mostly used in recurrent neural net for natural langauge

ReLU: +Rectifies the values of input less than zero thereby forcing them to zero and eliminating the vanishing gradient problem +Reduced likelihod of the gradient to vanish, in the region, z>0 the graidnet has a constant value +Constant garident results in faster leraning +Computational more efficeint to compute than sigmoid +Introduces sparsity in hidden units as it squishes the value between zero to maximum -Tends to blow up activation -Easily overfits compared to sigmoid -Another dyning ReLU probelm - if too many acitvaitons get below zero then most of the neruals will output zero

-Weights initializations: -Can be the difference between the network convergence in a reasonable amount of time and the network loss function not improving even after hundreds of iterations -If weights are initialized to very small value, their gradients tends to get smaller as we move backward through the lyaers, causes neurons in earlier layers learn much slowly than nerusons in later layers, as a result the weight update is negligible, vanishing gardients -If we initialize weights to a large value, then the gradients gets much larger in the earlier lyaerscausing weights to explide to infinity, exploding gardients

 Zeros/ones initailization: Initializtion by the same value no matter what the value causes the weight update by the same amount, called symetry breaking problem, so when initialized by zero no features are able to forward thorugh the network, when initialized by any value cases the neurons to learn the same thing, A import point is that features are forwarded forward

 Random initialization: Better, the number of neurons in a layer should be considred while defining the intiailization parameters, also the values initialized as weights in the neural nets are inveresly prop to no of neurons present in the layer
 
 Xavier initialization: Random may cause network to learn slowly which causes convergence problem, vanishishing and exploding gardinets, issues and instalibilites with training (weights haru le underlying distribution herena bhane z = w dot x chai thulo, sano huna sakxa), tries to make the variance of output layer to be equal to the variance of its inpts hile passing through each layer, need to pick weights from Gaussain distr with zero mean and variance of 1/N where N specifies the number of input neurons (i think N is average of input and output)
 
 He Normal: Extends Xavier when used with ReLU, variance is 2/N because for a rectifying linear unit, half of it is not active, so half of the time it will produce 0 so need to double the size o fweight variance to keep the signal's variance constant
 
 LeCun: Xavier is actually extension of LeCun, very similar with He except used Lecun is used with sigmoid, like He considers the neurons at the input layer and unlike Xavier, does not consdier the neruons at the output layer

-Optimizations: -Know it: batch, stochastic, minibatch -Having to calculate the derivative with only one random point can prove to be difficult in the long run as it would take a longer time to converge to the optimal solution -In schocastitc as the points are selected raondomly and updates happen frequently, there is high variance resulting in heavy flucatuation in the optiization path -Finding suitable learning rate is a challenge, may also encouter sitations where a constant learning rate leads to a comparatively slower convergence -Another concern comes with datasets of higher dimensionialyt, adding to the challenge of escaping local minimas, such datasets also come with saddle points

 Momentum: SGD ma current graident lai matra matlab garxa, yesma chai previously occuring gradients lai ni cumulatively consider garxa in exponential way
 
 m vector <- beta * m vector - eta * gradient of J(theta)
 theta <- theta + m
 
 Nesterov acceleratin: Mometeum baseds are great at gentle regions, moves very fast but has a problem of oscillations, move according to the history and then a bit more according to current graidnet, Nesterov le chai halka look ahead garxa
 
 m vector <- beta * m vector - eta * gradient of J(theta + beta*mvector)
 theta <- theta + m
 
 Adaptive gradient: Gradient based optimization which adapts the learnign rate to the paramters, perfomring smaller updates, the algorithm adaptaveily scales the learning rate for each dimension, thereby solviing daunting hyperparamter task, 
 
 RMS Prop:
 
 Adam:

Normalization: -Image data must be normalized to have same distribution, main advantages of using standardization is that it is less sensitive to outliers -min max -batch: is a normalization technique in which the output of each neuron in each hidden layer is normalized according to samples in a mini-batch, this reduces the amount that the values of the hidden layer shift around, but what about when inferefence time? -layer, tf is that? okay it is normalization across the layer

Regularization: -Euta way ta kei linear wala jastai bhaihalyo hai L1 and L2 wala -Dropout forces the neuron to learn useful features, sayad yesma certain neurons lai zeroes them out rey (meaning activation function is made zero)

 -Early stopping: Stop training eaerly when validaiton starts to decerase?
 -Increase the training data by utliling the same data and doing something to it??? like image wala?? is that a regularizaiton technique? probably yes
 -Something called wieght decay is there?????

Hyperparamter tuning: -LR scheduling: common to reduce LR somehow during like 3 epochs -I mean you could search LR and also other hyperparamters here, not really for DL i guess -Another is just heuristics, well well well

Convolutional Neural Network (looks good) -In signal processing you have a singal that changes your original signal, -In here can visualize that certain signal as a filter and original signal s our image

(some mathematical formulations)

-If filter size h * w on image size H * W, we have:
```
 Wo = Floor [ (W - w + 2P)/Sw + 1 ]
 Ho = Floor [ (H - h + 2P)/Sh + 1 ]
```
-In multi channel image, the output channel depends on number of kernels, so if there are 3 channels then the filter also need to have width of 3

Properties of CNN: -Local connectiveity: reduces number of paramters -Paramter sharing: same filter applied to different portion of the image -Shift Invariance: consequence of paramter sharing

Seminal architectures:
1. Alexnet: -(wild shits)
RNN (never done anywhere before; so I will addfrom other sources as well) -CNN for grid of values while RNN for sequentials, scales better than others for this -Just like CNN can process varying size of grids, most RNNs can projcess variable length sequence -Take a hidden layer output ht, at some instance t then at another instant the output of this hidden layer will depend on current input xt and earlier output of the hidden layer ht-1 -ie ht = f(ht-1, xt, theta); theta ta normal parmaters bhaihalyo sayad

-(from blogs): -In normal feedforward: input → (dense) → activation apply

-People say that RNNs are often simple feedforward with an internal state, hidden state bhannale paila time ko output lai save garni jastai hola

-input → (dense) -hidden → (dense) + → activation to get new hidden state -Now this hidden state goes to another (dense) layer then finally it real activation -Initial case ma hidden state zeros rey -Hidden state ma chai previous sabai iteration cumulatively save huni raixa rey

Pytorch version: -Requires two inputs: input and hidden state -Input: batch x sequence x input_size -Hdden: direcitonality x batch x input_size; initialized to zero
```
 -Returns two outputs, output at each stage of the sequence and final output
 -Output: batch x sequence x dimensionality*hidden_size
 -Final: directionality x batch x hidden_size;
 
```
-In case of image of size 28*28, we assume that it has 28 sequences, at each 28 sized vector input, the hidden state is updated and used at the next one -After the 28 sequence is finished (meianign first image is done) then the hidden state is reset

Drawbacks: -Information is not retained in the meomry when there is a large number of hidden timesteps (long term dependency problem?) -I think the reason for this is the vanishing gradient problem -Could be because if we have a long sequence then the continous update would eventually die out the neurons

-LSTM saves the past hidden states’ information for as long as it has to be stored, just the advancement or improvement of the RNN -Now alongside the hidden state we also maintain a cell state, which do not get matrix multiplied so there is no gradient exploding or dying -Previously, in each input sequence we inputted xt and ht-1 and matrix multiplied them, added and activated to get our hidden state -Now we take them and do same but to find the forget state ie
```
 ft = sigma(Wif xt + bif + Whf ht-1 + bhf)

 we will forget: ft hadamard ct-1
 
```
-(Now we forgotten some part of the long term memroy time to create some part from the current inputs)
```
 gt = tanh(Wig xt + big + Whg ht-1 + bhg)
 (this is our newly created long term memory portion)
 
```
-Also decide what percentage of it to use:
```
 it = sigma(Wii xt + bii + Whi ht-1 + bhi)
 
 we will add gt hadamard it
 
```
-Hence our new long term memory is:
```
 ct = ft hadamard ct-1 + it hadmard gt
 
```
-Finally for our next hidden state we have:
```
 ot = sigma(Wio xt + bio + Who ht-1 + bho)
 ht = ot hadamard tanh(ct)
 
```
Hence, here we have our final hidden state, the cell state is long term while hidden states are short terms -The use of sigma is to denote the percentage to retain or percentage to forget
CNN + RNN: -Image captionining (CNN as encoder and RNN as decoder)

Word embeddings: -To give near numbers to similar words, and because the same words can be used in different contexts or made plural or used in some other way, might be nice to asisg each word more than one number, so that NN can more easily adjust to dfiferent contexts
1. One Hot Encoding: You have a vocab of words, and now each word is now a vector with 1 in its vocab place and zero elsewhere
2. Bad of words: Now one sentence is one vector: you count the number of words in your vocab
3. N-gram: Now you make group of words and count their occurances
4. TF-IDF: some weird shits
(these are like making your own kernels in CNN)

Embeddings: -Better to check my code, where I covnerted the titles into 25 length tokens
Encoder Decoder Architecture -Sequence of one type to another: seq to seq, one way to solve is by encoder and decoder model -Ex: language translator -Need something that can take different length setnences and return also different length; variable input and variable output

The encoder is the usual RNN/LSTM that takes in sequences and produces a final context vector, in this case the sequence is terminated with The output of this encoder is used to initialize the decoder model, ex: the context vector would go into like cell state or whatever and so on, then for the first timestep we feed into to the decoder it outputs the first word of the sentence then now we feed the first word it gave us to get the next word and so on, until it returns us or something like that

Ex: seq2seq as a language translator:

Encoder: (takes in batch of sentences: tokens of one sentence all go into the sequence) -embedding -lstm -returns outputs, (hidden, cell)

Decoder: (takes in batch of sentences: tokens of one setnence go one by one) -embedding -lstm -one linear layer to map into vocab size so the current input token can be mapped into probability of being one of the tokens in the vocabulary of destination langauge -returns outputs, (hidden, cell)
```
 (the output we take for ourselves, hidden and cell are fed back into the layer)
 -Depending on traininig force we feed next input either the current output we just received (ie prediction corresponding to which vocab it is) or actual token the next should be
 
```
Drawbacks of vanilla seq2seq: -Long sentence bujhdena rey basically

Attention mechanism: -In sentnces long, a context vector cannot capture the meaning of the setnence properly, so instead of taking only the final hidden state of an encoder, we take the sum of all the hidden states from the encoder and use them as context vector, ie provide attention to all states -So, the idea of an Attention mechanism is that it avoids single vector representation for each input sequences. Instead of this, it pays attention to particular input vectors on the basis of the attention weights. -If we do addtions simply, the later words in the sequence will have higher impacts onthe context vector compared to the words that appeared at the begining of the sequence -We consider the outputs of all the encoder hidden states athe Encoder output, now since we have the encoder output of each hidden states we can tell the decoder where to focus on in the next steps -Now at timesteps the decoder raises the query to the Encoder Output block which is the independent representation or the concatenation of all the hidden states -So (my guess) is that we have a new layer that takes in the concatenated encoder outputs (at each time steps ofc) which tranforms it into the size equal to that of decoder output at any time step, then we dot product them -This operation is called attention becuase the aim of our query is to dfigure out which output of hidden state should we focus on or give attention to, ie we generate attention weights

(there are various attention mechanisms? what does this mean)

LSTM and convolutional done: (check the notebooks)

\subsection{Encoder} \begin{enumerate} \item (Inside the overall encoder): Does not attempt to compress the entire source sentence into a single context vector, instead produces a sequence of context vectors equal to number of tokens \item Standard embedding and positional embedding \item Token embeddigns multiplied by $d_{model}$ where $d_{model}$ is hidden dimension size, which reduces the variance in embeddings \item The combined embeddings are then passed through N encoders layers to get Z which is then can be used by decoder \item Along with embeddings we also send source mask to the encoder layers, which is same same as source sentence but has a value of 1 when token in the sentence if not a $<$ pad $>$ and 0 when it is $<$ pad $>$ \item (Inside the encoding layers): First pass the source and and its mask into multi-head attention layer, apply a residual connection and pass it through a layer normalization layer \item Then pass it through a point-wise feedforward layer and apply residual connection and then layer normalization \item (Inside the multi-head attention): Attention can be thought in terms of queries, keys and values; where the query is used with the key to get an attention vector (usually the output of a softmax operation and has values between 0 and 1 which sums to 1) which is then used to get a weighted sum of the values. $Attention (Q, K, V) = Softmax (\frac{Q K ^{T}}{d _{k}}) V$ where, the scaling $d_{k}$ is used to stop the results of dot product from growing large, causing gradients to become too small. \item The query, keys and values are of shape [batchsize, querylen/keyslen/valueslen, hiddim], so instead of doing a single attention application the QKV have their hiddim split into h heads and the scaled dot product is applied in parallel to all heads, then recombine the heads into their hiddim $Multihead (Q, K, V) = Concat (head_{1}, \dots, head_{h}) W^{o}$ $head_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$ \item (Inside the positionwise feedfoward): [never explained in paper] Transforms from hiddim to pfdim, which is usually a lot larger than hidim, applies RELU and transforms back \end{enumerate}

\subsection{Decoder} \begin{enumerate} \item Is similar to the encoding layer except that it now has two multi-head layers, self attention and encoder attention \item The first performs self-attention, as in the encoder, by using the decoder representation so far as the query, key and value, followed by residual connection and layer normalization. It uses the target sequence mask, in order to prevent the decoder from cheating by paying attention to tokens that are ahead of the one it is currently processing as it processes all tokens in the target sentence in parallel. \item The second is how we actually feed the encoded source sentence into our decoder, in here, the queries are the decoder representations and the keys and values are the encoder representations. The source mask is used to prevent the multi-head attention layer from attending to $<$ pad $>$ tokens within the source sentence which is then followed by the residual connection and layer normalization layers. \end{enumerate}

\newpage FAST FAST: >Directly run the following fastai code to see what it does then only comes the theory part:

from fastai.vision.all import *

path = untar_data(URLs.PETS)/'images'

dls = ImageDataLoaders.from_name_func(
    path, 
    get_image_files(path),
    valid_pct=0.2,
    seed=42,
    label_func=lambda x: x[0].isupper(),
    item_tfms=Resize(224)
)

learn = vision_learner(dls, resnet34, metrics=error_rate)

learn.fine_tune(1)

1. A dataset called oxford iiit pet dataset contains 7349 images of cats and dogs from 37 breeds is donwloaded to gpu server and extracted
2. A pretrained model that has already been trained on 1.3 million images using a compeition winning model will be donlwoad from the internet
3. The pretrained model will be finedtuned using the latest advances in transfer learning to create a model that is specially customized for recognizing cats and dogs

Note: if you run 1 and 2 again then it will use the donwloaded dataset and model but train again on the GPU

What goes under the hood?
-First is import all the functions and classes need to create variety of computer vision models
-The untar downloads a standard dataset from the fast.ai datasets collection to your server, extracts it if not previously and returns a Path object with the extracted location
-Third line tells fastai what kind of dataset we have and how it is structured here is ImageataLoaders
	>path is where our data is
	>get_image_files gets images recursively from the path
	>valid_pct is for validation ratio
	>seed
	>label_func is parent folder name as cat is uppercase
	>Resize to 224 pixel is a transfrom that is applied automatically to exs

Note: The metrics is calculated using validation set and never the training set

-Forth is create a vision architecture that take in structured datasets and is pretrained ImageNet resnet with error rate quality measurement on validation set which is not same as loss function

-Fifth tells fastai how to finetune (not fit) the model by passing number of epochs to run
	>Use one epoch to fit just those parts of the model necessary to get new random head to work correctly with your dataset
	>Use number of epochs requested when calling the method to fit the entire model, updating the weights of the later layers (especially the head) faster tahn the earlier layers



>For one line definition: instead of telling the computer the exact steps required to solve a problem, show it examples of the problem to solve, and let it figure out how to solve it itself from its experience

Feedback loops:
>A critical insight comes from considering how a model interacts with its environment which can create feedback loops:
4. A predictive policing model is created based on where arrests have been made in the past, which is not predicting actual crimes so there is partial biases
5. Officers then use that to decide where to focus their policing activity, resulting in increased arrests in those areas
6. Data on these additional arrests would then be fed back to retain future versions of the model

Note: This is a positive feedback look the more the model is used, the more biased the data becomes making the model even more biased and so forth

:Overfitting is single most important and challenging issuse as it is easy to create a model that does a great job at making predictions on the exact data it has been tarined on but much harder to make accurate predictions on data the model has never seen before:

When does overfitting occurs and how to avoid it?
-If you train for too long with not enough data you see the accuracy of your model on validation set start ot get worse
-Use avoidance only after you have confirmed that overfitting is occuring ie if you have observed the validation accuracy getting worse during training

Blackbox neural nets:
-Research exists showing how to deeply inspect deep learning models and get rich insights from them


Is this for image only?
-A lot of things can be represented as images which means an recognized can learn to complete may tasks for instance, a sound can be converted to a spectrogram which is a chart that shows amount of frequency at each time in an audio file
-A time series can easily be converted into an image by simply plotting the time series on a graph 
-However it is often a good idea to try to represent your data in a way that makes it as easy as possible to pull out the most important components; in a time series, things like seasonality and anomalies are most likely to be of interest
-Other instances are mouse movements and binary files converted into images

Note: If the human eye can recognize categories from the image, then a deep learning model should be able to do so too

Whats more than classifying images?
-Another important is localizing object in a picture for autonomous vehicles
-Following is fastai code using a subset of the CamVid dataset of whose flow is same as before:

path = untar_data(URLs.CAMVID_TINY)

dls = SegmentationDataLoaders.from_label_func(
    path, 
    bs=8, 
    fnames = get_image_files(path/"images"),
    label_func = lambda o: path/'labels'/f'{o.stem}_P{o.suffix}',
    codes = np.loadtxt(path/'codes.txt', dtype=str)
)

learn = unet_learner(dls, resnet34)
learn.fine_tune(8)

-See the results of the model using:

learn.show_results(max_n=6, figsize=(7,8))


Okay enough about images? any text?
-Can generate text, translate automatically from one language to another, analyze comments and much more
-To train a model that can classify the sentiment of a movie review better than anything that existed in the world five years ago:
-Dataset is IMDb large movie review:

dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')

s learn = text_classifier_learner( dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy )

learn.fine_tune(4, 1e-2)

-Predict the reviews as:

learn.predict("I really liked that movie brother")

I want tabular man?
-Usually no pretrained available for this task so use fit instead of fine tuning:
-The process is similar as the following code predict whether a person is a high-income earner based on their socio-economic background:
-Dataset is using Adult from a paper that scales up the accurarcy of Naive-Bayes Classifiers

from fastai.tabular.all import *

path = untar_data(URLs.ADULT_SAMPLE)

dls = Tab

Note: Had to tell which columns are categorical ie contain values that are one of a discrete set of choices such as occupation versus continuous ie contain a number that represents a quantity such as age


One more:
-Recommendation systems are important particularly in ecommerce so following code predicts movies people might like based on their previous viewing habits using the MovieLens dataset:

from fastai.collab import *

path = untar_data(URLs.ML_SAMPLE)
dls = CollabDataLoaders.from_csv(path/'ratings.csv')
learn = collab_learner(dls, y_range(0.5, 5.5))
learn.fine_tune(5)

learn.show_results()

-Note using a pretrained model for the same reason we didnt for the tabular model but uses fine_tune anyway; its best to experiment fine_tune vs fit_one_cycle to see which works best for your dataset


VALIDATION SETS:
-If we trained a model with all our data and then evaluated the model using the same data we would not be able to tell how well our model can perform on the data it hasnt seen
-Susequent versions of the model are indirectly shaped by us having seen the validation data; just as the automatic training process is in danger of overfitting the training data we are in danger of overfitting the validaiton data through human trial and error and exploitation
-Solution is to introduce another level of more highy refined data: test setwhich can only be used to evaluate the model at the very end of our efforts

Note: 
Test and validation tests should have enough data to ensure that you get a goog destimate of your accuracy, if you are creating a cat decetor need 30 cats in validaiton sets, if you hvae dataset with thousands of items using 20% of the set may be more than you need

"A KEY PROPERTY OF THE VALIDATION AND TEST SETS IS THAT THEY MUST BE REPRESENTTATIVE OF THE NEW DATA YOU WILL SEE IN THE FUTURE"

1. Want more than just randomly grap a fraction of your original dataset
2. If you are looking at time series data choosing a random set will be easy and no representative of most business use cases, so you will want to choose a continouous section for instance, two weeks or last month of avaialble data
3. Another common is when you can easily anticipate ways the data you will be making predictions for in production may be qualitatively differnet from the data you have to train your model with; ex distracted driver compeition where images are of same person in different positions


Production tips:
-Keep an open mind to the possiblity that deep learning might solve part of your problem with less data or complexity than you expect
-Goal is not to find the perfect dataset but to get started and interate from there
-Best to first start by finding example online of some thing that somebody has had good results with and that is at least somewhat similar to what you are trying to achieve by converting your data into a similar format to what someone else has used before such as creating an image from your data


State-of-arts:

4. Computer vision
-Major works around recogniztion, detection and segmentation
-Generally not good at recognizing images that are significantly different in structure from those used to train the model for ex: may be only black and white or only hand-written images in train set
-No general way to check which types of images are missing in your training set
-Labelling can be slow and expensive, one particularly helpful is to synthetically generate variation of input images by rotating or chaning their brightness ie. data augmentation

5. Text:
-Good at classifying both short and long documents based on categories, sentiment (pos or neg), author, source website, and so forth
-Also good ate generating context-appropriate text such as replies to social media posts, and imitatint a particular authors style
-However is not good at generating correct responses so dont have a reliable way to combine a knowledge base of medical info with a deep model for generating medically correct natural language responses
-Many applications such as translation from one lang to another, summarize long documents, find mention of a concept of interest and more
-Well could include completely incorrect information but the performance is good enough to be used in current systems

6. Combining text and images:
-Ability to combine text and images into a single model is generally far better than most people intutively expect
-Can train on input iamges with output captions to generate suprisingly appropriate captions automatically for new images: with no gurantee that these captions will be correct
-So recommended to be used as part of a process in which the model and a human user interact closely
-Text to images?

7. Recommendation systems:
-Are just really a special type of tabular data that generally have a high cardinaltiy categorical variable representing users, and another one representing products or something similar
-As deep models are good at handling high cardinality categorical variables, they are quite good at handling recommendation systems

8. Other data types:
-Ofther domain specific data types fit nicely into existing categories, for instance, protein chains look a lot like natural langugage documents, in that that they are long sequences of discrete tokens with complex relationships and meaning throughout the sequence
-Sounds can be represented as spectrograms which can be treated as images; turn out really well on spectrograms


Approach to deep learning by Jeremy:
9. Start with considering your objective
10. Think about what actions you can take to meet that objective
11. What data you have or can acquire that can help
12. Build a model that you can use to determine the best actions to take to get the best results in terms of your objective

Gathering data:
-For many types of projects you may be able to find all the data you need online through services 

from fastcore.all import *
from fastai.vision.all import *
from duckduckgo_search import DDGS
from fastdownload import download_url

#Get urls for your search:
urls = L(DDGS().images(' ... ')).itemgot('image')[:100]

#Check one of the url:
dest = '....jpg'
download_url(urls[0], dest, show_progress=False)
im = Image.open(dest)
im.to_thumb(256 ,256)

#Download all the images:
path = Path('images')
cat = '...'
dest = (path/cat)
dest.mkdir(exist_ok=True, parents=True)
download_images(dest, urls=urls)

#Delete the failed images:
failed = verify_images(get_image_files(path))
failed.map(Path.unlink)

#Make something called datablock structure that contains train and valid sets
data = DataBlock(
    #inputs are images and outputs are categories:
    blocks=(ImageBlock, CategoryBlock), 
    #use get_image_files which returns a list of all image files in a path
    get_items=get_image_files, 
    #train and test set
    splitter=RandomSplitter(valid_pct=0.2, seed=42),
    #labels are just folder names
    get_y=parent_label,
    #resize the images
    item_tfms=[Resize(192)]
)

#Make dataloader out of the datablock
dls = data.dataloaders(path)
dls.show_batch(max_n=6)
	
#Pretraining ImageNet resnet34 model:
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(3)

#Prediction using the model:
learn.predict(PILImage.create('....jpg'))


Data vs dataloaders:
-Dataloaders is a thin class that just stores whatever DataLoader objects you pass to it and makes them available as train and valid for the models
-To turn our donwloaded data into a DataLoaders object need to tell fastai at least four things
13. What kinds of data we are working with (blocks)
14. How to get the list of items (get_items)
15. How to label these items (get_y)
16. How to create the validation set (splitter)

-Upto now pretraining was eased with data structure that happen to fit into those pretrained model, for when not available, fastai has an extermely felxible system called the datablock API with which can customize every stage of the creation of your DataLoaders
-Need to add a transform that will resize these images to the same size, so images can fit into a tensor (item_tfms)
-By default Resize crops the images to fit a square shape of the size requested using the full width or height; or alternatively could squish or stretch them
	data = data.new(item_tfms=Resize(192, method='squish'))
	
-All these seem wasteful instead what we normally do in practice is to randomly select part of the image and them crop to just that part

	data = data.new(item_tfms=RandomResizedCrop(192, min_scale=0.3))

-The random crop is a specific example of a more general technique called data augmentation which refers to creating random variations of our imput data, such that they appear different but do not change menaing of the data
-For natural images:

	data = data.new(item_tfms=Resize(192), batch_tfms=aug_transforms(mult=2))
	
	
Using the model to clean the data:
-To visualize the mistakes the model is making can create a confusion matrix which is calculated using the validation set

	interp = ClassificationInterpretation.from_learner(learn)
	interp.plot_confusion_matrix()
	
-Helpful to see where exactly are ours error occuring to see whether they are due to a dataset problem or a model problem, to do which can sort our images by their loss, which is high when the model is incorrect especially but also confident of its incorrect answer or if its correct but not confident of its correct answer
(must do this)
	
	interp.plot_top_losses(6)

-The general intuitive approach to data cleaning is to do it before you train a model but a model can help you find issues more quickly and easily so we normally prefer to train a quick and simple model first, and then use it to help us with data cleaning
-fastai includes a handy GUI for data cleaning that allows you to choose a category and the training versus validation set and view the highest loss images in order along with menus to allow images to be selected for removal

	from fastai.vision.widgets import ImageClassifierCleaner
	cleaner = ImageClassifierCleaner(learn)
	cleaner

Note: After cleansing need to recreate datablock to retrain the model

	for idx in cleaner.delete(): cleaner.fns[idx].unlink()

	

Turning your model into an online application:
-Once model is ready, save it to copy to the production server to save which is the export method:
	learn.export()
-To create our inference learner from the exported file, we use load_learner as:

	path = Path()
	learn = load_learner(path/'export.pkl')
-Also can access DataLoaders object as an attribute of the learner:
	
	learn.dls.vocab

The notebook app:
-Can create a complete working web app using nothing but Jupyter notebooks using ipywidgets and voila
-IPython widgets are GUI components that bring together JS and python functionality in a web browser and can be created and used within a Jupyter notebook
-Voila exists for making applications consisting of IPython widgets available to end users without them having to use Jupyter at all

17. Create a button to upload images for classification:
	
	from fastai.vision.widgets import widgets
	btn_upload = widgets.FileUpload()
	btn_upload
	
18. Create a button to run the inference on the models:

	btn_run = widgets.Button(description='Classify')
	btn_run
	
19. Create a cell for images under classifiton process:
	
	out_pl = widgets.Output()
	out_pl

20. Create a cell for displaying the propbabilities:

	lbl_pred = widgets.Label()
	lbl_pred

21. Create a function that runs when click on the classify:

	def on_click_classify(change):
	  img = PILImage.create(btn_upload.data[-1])
	  out_pl.clear_output()
	  with out_pl: display(img.to_thumb(128,128))
	  pred,pred_idx,probs = learn.predict(img)
	  lbl_pred.value = f'Prediction: {pred}; Probability: {probs[pred_idx]:.04f}'
	btn_run.on_click(on_click_classify)
	
	-First create an img out of the last uploaded image
	-Clear the old output to make room for display
	-Make prediction and display the probabilities
	
22. Can put all the widgets in a vertical box to complete GUI:
	
	VBox([
		widgets.Label('Select your bear!'),
		btn_upload,
		btn_un,
		out_pl,
		lbl_pred
	])
	
-<Hugging face deployment>

-A lot of libraries and frameworks allow to integrate model directly into edge devices but require extra steps and boilerplate and do not always support all the PyTorch and fastai layers
-Instead wherever possible deploy the model itself to a server and have your mobile or edge applications connect to it as a web service

What could go wrong now?
-There may be data that our model sees in production that is very different from what is saw during training (out-of-domain)
-Type of data our model sees changes over time and the type of risks it represents may change so much that the original training data is no longer relevant (domain-shift)
-Both are examples of a larger problem: that you can never fully understand all the possible behaviours of a neural network because they have far too many paramters
-Model may change the behaviour of the system it is a part of: in the presence of bias, feedback lookps can result in negative implications of that bias getting worse and worse (nigg)

Solutions:
23. First step is to use an entirely manual process with your deep learning model approach running in parallel but not being used directly to drive any actions with humans invovled in process should look at outputs and check whether they make sense
24. Second step is to try to limit the scope of the model geographically and time-constrained and have it carefully supervised by people
25. A helpful exercise prior to rolling out a significant machine learning system is to consider the question what would happend if it went really really well in other words, what if the predictive power was extremely high and its ability to influence behaviour was extremely significant? in which case who would be most impacted?


FUNDAMENTS OF VISION:

-MNIST contains images of handwritten digits collected by National Institute of Standards and Technology and collated into a machine learning dataset by Yann Lecun and his colleagues

from fastai.vision.all import *

-MNIST_SAMPLE contains handwritten 3s and 7s

path = untar_data(URLs.MNIST_SAMPLE)
path.ls()

-See what the path contains

(path).ls()
(path/'train').ls()

-Get trains of posix paths of 3s and 7s

threes = (path/'train'/'3').ls()
sevens = (path/'train'/'7').ls()

-Open the path as PIL image and see it

im3_path = threes[0]
im3 = Image.open(im3_path)
im3

-See its sliced array or tensor format

array(im3)[4:10,4:10]
tensor(im3)[4:10,4:10]

-Make a dataframe and view the colored pixel values

im3_t = tensor(im3)
df = pd.DataFrame(im3_t[4:20,4:20])
df.style.background_gradient('Greys')
	

>First create a list of 2D tensor containing all the images in a directory then stack them into a tensor

sevens_stack = torch.stack([tensor(Image.open(o)) for o in sevens]).float()/255
threes_stack = torch.stack([tensor(Image.open(o)) for o in threes]).float()/255


PyTorch:
	-A PyTorch tensor is nearly same as a NumPy array but with an additional restriction that it has to use a single basic numeric type for all components
	-PyTorch tensors have additional capabilities that these structures can live on GPU in which case their computation will be optimized for the GPU and can run much faster


Stochastic Gradient Descent:
	-Do not have any kind of weight assignments so cant really improve the pixel similarity approach by modifying a set of paramters 
	-Steps required to turn into a machine learning classifier:
	1. Initialize the weights
	2. For each image use these weights to predict whether it appears to be 3 or a 7
	3. Based on these predictions calculate how good the model is (its loss)
	4. Calculate the gradient which measures for each weight how changing that weight would change the loss
	5. Step all the weights based on that calculation
	6. Go back to step 2 and repeat the process
	7. Iterate until you decide to stop the training process for instance, because the model is good enough or you dont want to wait any longer
	
	>Many ways to do each of these seven steps 
	
	
Grad of pytorch:
	
	def f(x): return x**2
	
	xt = torch.tensor(3.).requires_grad_()
	
	-Tells that we want to calculate gradients with respect to that variable at that value; essentially tagging the variable so pytorch will remember to keep track of how to compute gradients of the other direct calculations on the tensor that you will ask for
	
	yt = f(xt)
	
	yt.backward()
	
	xt.grad
	
Learning rate:
	-Too high may diverge or bounce around
	-Gradient ta sabko farak farak aauna sakxa bhanesi optimum learning rate ta sabko farak bhaihalxa ni yar function mai depend garxa ni ta
	

	
>Make a train set by concatenating and reshaping the 3 and 7 tensors

train_x = torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28)

train_y = tensor([1]*len(threes) + [0]*len(sevens)).unsqueeze(1)

>Indexing a dataset in pytorch should return (x,y) tuple

dset = list(zip(train_x, train_y))
	
>Initializement of the parameters:

def init_params(size, std=1.0):
	return (torch.randn(size)*std).requires_grad_()
	
weights = init_params((28*28,1)) #mind the shape?
bias = init_params(1)

>Find prediction for one image:

torch.dot(weights,train_x[0])+bias

>To make prediction for multiple images use matrix multiplication:

def linear1(xb):
	return xb @ weights + bias
	
preds = linear1(train_x)
	
[batch @ weights + bias]

>Need a loss function to quantify how good our model is doing:
	-0/1 Loss: Convert into actual class 3 or 7 directly by comparing with 0 and add the correct classes to find the mean

>Need something that when our weights result in slightly better predictions gives us slightly better loss
	-Distance loss: First convert the prediction numbers from the weights into a probablity so it has some meaning, then find the distance between them:

def mnist_loss(predictions, targets):
	predictions = predictions.sigmoid()
	return torch.where(target==1, 1-predictions, predictions).mean()	

>Key difference is that the metric is to drive human understanding and the loss is to drive automated learning which must have a meaningful derivative, cant have big flag sections and large jumps but instead be reasonably smooth
>Is rather a compromise between our real goal and a function that can be otpimized using its gradient
	
	
SGD and mini-batches:
-(so we have actual vector and prediction vector and calculate the loss between them but what is the size of the vector we will use? all the examples?)
-Calculating the gradient for whole dataset would take a long time while for single data iem would not use much information so would result in an imprecise and unstable gradeint
-Make a compromise and choose a batch size
-Another good reasons for using minibtaches is that we do our training on an acceleartor such as a GPU which only perform well if they have lots of work to do at a time; also if give them too much data to work on at once, they run out of memory
-Hence we get better generalization if we can vary things during training through the structure called DatLoader
-Need to randomly shuffle the dataset on each epoch
-A dataloader can take python collection and turn it into an iterator over many batches, like so:

#It reshuffles the data after each epoch (epoch here probably means after finishing a whole dataset)
dl = DataLoader(dset, batch_size=256)

>Training one epoch of the model (requires what model ie its function and what paramters it has so can get their gradient after calculating the value of the model at dataloader batch)

def train_epoch(model, params, lr):
	for xb,yb in dl:
		preds = model(xb)
		loss = mnist_loss(preds, yb)
		loss.backward()
		for p in params:
			p.data -= p.grad*lr
			p.grad.zero_()
			

What we did steps?
1.Make a train_x and train_y which is a list of X and Y tensors
2.Zip out the train_x and train_y to make the dataset object and also make a dataloader
-----
3.Define a linear model function that takes in multiple X's and initialization of the weights and biases paramters
-----
4.Describe a loss function in the same batch way
5.Make a train_epoch that interates the dataloader to update params
	>Find the loss and do backward gradients
	>Now update the paramters using the gradients
	
dataloader -> model -> initialize -> loss -> epoch

(second ko duita ra last ko duita merge jasto garna milxa haina lagbhag same gang bhitra halna milxa)

Creating an optimizer:
-This is a great foundation so pytorch provides some useful classes to make it easier to implement
-Can replace the step (2+3) with nn.Linear module and others; a module is an object of a class that inherits from the nn.Module class 

linear_model = nn.Linear(28*28, 1)

-Some used methods are:
	>linear_model.parameters()
	>linear_model(train_x)


-Step (4+5) can be eased with creation of an optimizer:

-Can use this information to create an optmizer whose main methods are init, step and zero_grad:
	>provide params and lr to init
	>after calculating the loss do step
	>and zero_grad to clean the gradients			

	opt = torch.optim.SGD(linear_model.parmaters(), lr)
	
	
-Now the loop can be simplified as
	
	def train_epoch(model):
	  for xb,yb in dl:
	    preds = model(xb)
	    loss = mnist_loss(preds, yb)
	    loss.backward()
	    opt.step()
	    opt.zero_grad()

-fastai provides a Learner class which we can use instead of writing this whole train_epoch function every time, to create a Learner we first need to provide everything this function has:
	8. dataloader (requires train and valid stack together)
	9. model
	10. loss function
	11. optimizer
	12. optionally metrics to see the human level progress

	[here we have not created a valid_dl]

	dls = DataLoaders(dl, valid_dl)

	learn = Learner(dls, linear, opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy)
	
	
	
MORE DEEPER:
-Data is usually provided in following ways:
	>A table of data in which each row is an item and may include filenames providing connection between the data in the table and data in other formats such as text doucments images
	>Individual files representing items of data such as text documents or images possibly organized into folders or with filenames representing information about those items
	>Exception to these rules particularly in domains such as genomics where there can be binary database formats or even network streams 


Note: when data are individual files then their label can be extracted from the folder they are in or their filename extracted using regular expressions

............ (idk forgot where I was) ..........

Stable Diffusions: -

\end{document}

My Knowledge Base

Explorer

Overleaf Pulchowk

Finance prediction